RajeshThallam · March 23, 2016 06:37
diff --git a/job-posts-indexing b/job-posts-indexing
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Job Fiction - Indexing Jobs Data to Build a Training Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objective\n",
    "\n",
    "Objective of this notebook is to build a training model based on JOBFICTION database, a collection of job posts, job titles, company, location, job post URL acquired from Indeed Web Services API. Using the training model, we will be able to predict right job title based on the job descriptions passed to the model. Output from the training model would include - a corpus based on vector space model, key words and phrases, skill identifiers, predicted job titles and corresponding scores. All the results will be persisted and updated with the new jobs being collected.\n",
    "\n",
    "Based on the input from job seekers i.e. job descriptions submitted we will able to determine \n",
    "\n",
    "- Job titles closest to the job description or keywords submitted (based on the weights associated)\n",
    "- Recommended job posts\n",
    "- Keywords to search for the right job posts\n",
    "\n",
    "The first part of this notebook will explore how jobs in the JOBFICTION database can be classified. \n",
    "\n",
    "**Why do we have to classify the job posts?**\n",
    "\n",
    "A `truck driver` job post is way different from a `database administrator` job post. With the help of clustering algorithms we categorize similar jobs into same cluster based purely on the job description. Similar to movie genres this classifier is expected to create job categories based on similarity of job descriptions. We can then study the job titles under the same cluster to see how true clusters. Since there is no training data set available we resort to unsupervised clustering and the challenge is to define the number of clusters.\n",
    "\n",
    "We focus only on the data related job posts i.e. job posts with the word \"data\" in either job title or job description.\n",
    "\n",
    "## Approach\n",
    "\n",
    "- Export job descriptions, job title, company and job id from JOBFICTION database\n",
    "- Remove stop words\n",
    "- Tokenize and stem each job description\n",
    "- Transforming the corpus into vector space using tf-idf\n",
    "- Clustering the documents using the k-means algorithm \n",
    "- Plot the clusters\n",
    "- Using multidimensional scaling to reduce dimensionality within the corpus (LSI)\n",
    "- Topic modeling using Latent Dirichlet Allocation (LDA)\n",
    "- Named entity recognition against occupation skills and title taxonomies to identify skills\n",
    "\n",
    "**(Future Work)**\n",
    "\n",
    "- Hierarchical clustering on the corpus using [Ward clustering](http://en.wikipedia.org/wiki/Ward%27s_method)\n",
    "- Plot the clusters with hierarchial clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "from stop_words import get_stop_words\n",
    "from nltk.corpus import stopwords\n",
    "from gensim import corpora, models, similarities\n",
    "from sklearn.cluster import KMeans, MiniBatchKMeans\n",
    "from collections import Counter\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from wordcloud import WordCloud\n",
    "import logging\n",
    "import random\n",
    "import gensim\n",
    "import nltk\n",
    "import re\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "DATA_DIR = os.path.join(\"/home\", \"rt\", \"wrk\", \"jobs\", \"data\")\n",
    "MODEL_DIR = os.path.join(\"/home\", \"rt\", \"wrk\", \"jobs\", \"models\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Export data from JOBFICTION database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's extract jobs from JOBFICTION database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the jobs table, job description is an array of sentences. In order to export job description, this mongo javasript will be run to combine array elements as a string. For traceback we will add __id field to every record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting export_jobs_with_title.js\n"
     ]
    }
   ],
   "source": [
    "%%writefile export_jobs_with_title.js\n",
    "db.jobs.find({\"summary\": /data/}, { _id: 1, jobtitle: 1, company: 1, summary: 1}).forEach( function (x) \n",
    "    {     \n",
    "        var jobdesc = '';\n",
    "        var s = ''\n",
    "        x.summary.forEach( function (y) { \n",
    "            s = y.replace(new RegExp('\\r?\\n','g'), ' ').replace(new RegExp('[|]','g'), '');\n",
    "            jobdesc += s + ' '; \n",
    "        });     \n",
    "        print(x._id + \"|\" + x.jobtitle + \"|\" + x.company + \"|\" + jobdesc);\n",
    "    });"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: cannot create directory ‘./data’: File exists\r\n"
     ]
    }
   ],
   "source": [
    "!mkdir ./data ./models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Run export script to dump data to text file **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r\n",
      "real\t4m54.169s\r\n",
      "user\t0m46.439s\r\n",
      "sys\t0m3.300s\r\n"
     ]
    }
   ],
   "source": [
    "!time mongo JOBFICTION --quiet export_jobs_with_title.js > ./data/export_jobs_w_title.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "144554 ./data/export_jobs_w_title.txt\n",
      "indeed_6ed966da9f33ffc1|Associate|Potbelly Sandwich Shop|Presidential Towers!!!!!! A Potbelly Associateâs job is to make our customers really happy. Since they are the primary point of customer contact, it is up to them to provide our customers and excellent experience by providing fast, friendly and efficient service and by delivering a quality and consistent product every time, in a clean and inviting environment. Essential ï§ Demonstrates and reinforces Potbellyâs Behaviors and Valuesâ Integrity, Food Loving, Teamwork, Accountability, Positive Energy, Coaching, Delivering Results through Execution, Building and Inspiring Teams, Creating Potbelly âFansâ-- through all interactions. ï§ Ability to discuss Potbelly history with others. ï§ Prepare quality finished products (sandwiches, salads, soups, cookies, ice cream, etc.) efficiently per Potbelly recipe manual standards. ï§ Comply with health and safety standards for food, cleanliness and safety of shop. ï§ Maintain personal hygiene standards, including wearing clean Potbelly uniform. ï§ Comply with established food safety requirements and practices. ï§ Comply with shop security and safety standards. ï§ Be speedy and accurate in fulfilling orders. ï§ Handle raw and finished waste according to established procedures. ï§ Make customers really happy. ï§ Engage in friendly conversation with customers in line. ï§ Act with a sense of urgency toward all customers in the shop. Other Key Functions ï§ Restock food line, chips and cooler. ï§ ï§ Work multiple stations (load, dress, shakes, cash, prep, front) as directed by Manager. ï§ Deliver catering orders as detailed in the Catering Driver and Delivery Agreement. ï§ Clean tables, counters, floors, bathrooms, kitchen and utensils; take out trash. ï§ Operate cash register: handle, balance and follow all cash handling procedures. ï§ Effectively handle customer complaints/issues. ï§ Take catering and delivery orders over the phone. ï§ PHYSICAL FUNCTIONS ï§ Ability to stand/walk a minimum of 3 hours or as needed. ï§ Must be able to exert well-paced and frequent mobility for periods of up to 3 hours or as needed. ï§ Be able to lift up to 10 pounds frequently. ï§ Will frequently reach, feel, bend, stoop, carry, finely manipulate and key in data. ï§ Able to work in both warm and cool environments, indoors (95%) and outdoors (5%). ï§ Must be able to tolerate higher levels of noise from music, customer and employee traffic. ï§ Must be able to tolerate potential allergens: peanut products, egg, dairy, gluten, soy, seafood and shellfish. EXPERIENCE, EDUCATION AND BEHAVIORS ï§ Must represent Potbelly Advantage and Our Values. ï§ Must be at least 16 years of age ï§ For Illinois employees, all employees are required to become food safety certified within 30 days of employment. Failure to do so will result in termination of employment. ï§ Must be friendly and customer service-oriented. ï§ Strong verbal communication skills. ï§ Must possess neat and clean hygiene. ï§ Ability to handle a knife confidently. ï§ Must be able to work in a fast-paced environment and have a sense ofï§ Ability to work as a team-player. ï· Ability to comprehend and communicate in English via verbal and written communication, such that employee can perform his or her job responsibilities. ï§ Must demonstrate leadership behaviors and values that align with Potbelly urgency. Potbelly.Com/Careers Job Type: Part-time Local candidates only: Chicago, IL 60661 Required education: High school or equivalent \n"
     ]
    }
   ],
   "source": [
    "!wc -l ./data/export_jobs_w_title.txt\n",
    "!head -1 ./data/export_jobs_w_title.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create training data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will export random 10K job descriptions as training data set. We will use unsupervised clustering to see how similar job descriptions are. based on clusters we can do topic modeling with LDA for each cluster. We can keep updating the model with new job posts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** *Below sort to be optimized by randomized only job ids instead of entire text.* **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sort: write failed: standard output: Broken pipe\n",
      "sort: write error\n",
      "\n",
      "real\t8m32.986s\n",
      "user\t8m30.838s\n",
      "sys\t0m2.053s\n"
     ]
    }
   ],
   "source": [
    "!time sort -t'|' -k1 -R ./data/export_jobs_w_title.txt | head -10000 > ./data/train_w_complete_text.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m0.570s\n",
      "user\t0m0.068s\n",
      "sys\t0m0.021s\n",
      "\n",
      "real\t0m2.101s\n",
      "user\t0m1.167s\n",
      "sys\t0m0.125s\n"
     ]
    }
   ],
   "source": [
    "!time awk -F'|' 'BEGIN{OFS=\"|\"}{print $1, $2, $3}' ./data/train_w_complete_text.txt > ./data/train_labels.txt\n",
    "!time awk -F'|' 'BEGIN{OFS=\"|\"}{print $4}' ./data/train_w_complete_text.txt > ./data/train.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "indeed_6d13e1749c444e23|Financial Examiner (EL)|GA Dept of Banking & Finance\r\n",
      "indeed_6d16914061219ee4|Analytics Payer/Provider Healthcare Analytics Manager|PRICE WATERHOUSE COOPERS\r\n",
      "indeed_50c9ebbfb19f9ed7|Aircraft Maintenance Analyst|Ronkonkoma, NY\r\n",
      "indeed_6d1fbfcd14cf79e9|Operations Center Representative - All Shifts|Ascent LLC.\r\n",
      "indeed_9a61d5c6de9dec4b|Administrator, Payroll|Community Action Project\r\n",
      "indeed_53c5e81c18aa4202|Project Coordinator/Data Analyst|The Fund for Public Health in New York, Inc.\r\n",
      "indeed_bf4b755eadef6b10|Plant Manager|IEC Holden Inc.\r\n",
      "indeed_e5ee1725b888eeb0|IT Infrastructure & Security Manager|Collibra\r\n",
      "indeed_08b4c32dcb730ba2|Material Control Specialist l|PRIMUS\r\n",
      "indeed_3aede0ed8048b044|Licensed Financial Advisor|Scient Federal Credit Union\r\n"
     ]
    }
   ],
   "source": [
    "!head ./data/train_labels.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!tail -1 ./data/export_jobs_w_title.txt > ./data/test_w_complete_text.txt\n",
    "!awk -F'|' 'BEGIN{OFS=\"|\"}{print $1, $2, $3}' ./data/test_w_complete_text.txt > ./data/test_labels.txt\n",
    "!awk -F'|' 'BEGIN{OFS=\"|\"}{print $4}' ./data/test_w_complete_text.txt > ./data/test.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!head -2 ./data/train.txt | tail -1 > ./data/sample.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Cleansing Data - Stop words, Tokenizing and Stemming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Failing to cleanse and normalize the data properly can decrease the overall effectiveness of the model. Let's define few functions before we take off"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# replace forward and back slash, hyphen, underscores and other characters\n",
    "def preprocess(text):\n",
    "    clean = text\n",
    "    clean = re.sub(\"[/_-]\", \" \", clean)\n",
    "    clean = re.sub(\"[^a-zA-Z.+3]\", \" \", clean) # get rid of any terms that aren't words\n",
    "    return clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define a tokenizer and stemmer to returns the set of stems in the text passed\n",
    "\n",
    "def tokenize_and_stem(text):\n",
    "    # tokenize by sentence, then by word to catch any punctuations\n",
    "    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n",
    "    filtered_tokens = []\n",
    "    \n",
    "    # remove stop words from tokens\n",
    "    en_stop = set(get_stop_words('en') + stopwords.words(\"english\"))\n",
    "    stopped_tokens = [i for i in tokens if not i in en_stop]\n",
    "    \n",
    "    # filter out tokens not containing alphanumeric\n",
    "    for token in stopped_tokens:\n",
    "        if re.search('[a-zA-Z]', token):\n",
    "            filtered_tokens.append(token)\n",
    "\n",
    "    stems = [stemmer.stem(t) for t in filtered_tokens]\n",
    "\n",
    "    return stems\n",
    "\n",
    "def tokenize_only(text):\n",
    "    # tokenize by sentence, then by word to catch any punctuations\n",
    "    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n",
    "    filtered_tokens = []\n",
    "    \n",
    "    # remove stop words from tokens\n",
    "    en_stop = set(get_stop_words('en') + stopwords.words(\"english\"))\n",
    "    stopped_tokens = [i for i in tokens if not i in en_stop]\n",
    "    \n",
    "    # filter out tokens not containing alphanumeric\n",
    "    for token in stopped_tokens:\n",
    "        if re.search('[a-zA-Z]', token):\n",
    "            filtered_tokens.append(token)\n",
    "    \n",
    "    return filtered_tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create p_stemmer of class SnowballStemmer\n",
    "stemmer = SnowballStemmer(\"english\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# compile training docs into a list\n",
    "train = [ preprocess(line.decode('unicode_escape').encode('ascii', 'ignore')) for line in open(os.path.join(DATA_DIR, 'train.txt'), 'r') ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# compile training labels for tracking and debugging purposes only\n",
    "train_labels = [ line.strip('\\n').split('|') for line in open(os.path.join(DATA_DIR, 'train_labels.txt'), 'r') ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['indeed_6d13e1749c444e23',\n",
       " 'Financial Examiner (EL)',\n",
       " 'GA Dept of Banking & Finance']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_labels[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating persistent files with words (i)  tokenized and stemmed and (ii) tokenized separetely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "FILE_STEM = os.path.join(DATA_DIR, 'train_stem.txt')\n",
    "FILE_TOKEN = os.path.join(DATA_DIR, 'train_token.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Calling tokenizer and stemmer functions on the training data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f_stem = open(FILE_STEM, 'w')\n",
    "f_token = open(FILE_TOKEN, 'w')\n",
    "\n",
    "for jobdesc in train:\n",
    "    stemmed = tokenize_and_stem(jobdesc) \n",
    "    f_stem.write(' '.join(stemmed).encode('utf-8').strip() + '\\n')\n",
    "    \n",
    "    tokenized = tokenize_only(jobdesc)\n",
    "    f_token.write(' '.join(tokenized).encode('utf-8').strip() + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Bag-of-Words (BoW) Corpus & Dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating Dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
      "Wall time: 8.82 µs\n",
      "Dictionary(47674 unique tokens: [u'fawn', u'nordisk', u'raining', u'environments.investment', u'prologistix']...)\n"
     ]
    }
   ],
   "source": [
    "%time\n",
    "dictionary = corpora.Dictionary([line.lower().split() for line in open(FILE_TOKEN)])\n",
    "dictionary.compactify()\n",
    "dictionary.save(os.path.join(MODEL_DIR, \"train_jobs.dict\"))\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** For scalability reason, using iterator to stream job description one by one instead of reading all jobs at a time in memory**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each document in the tokenized file is converted to bag-of-words model before storing as a corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class jobCorpus(object):\n",
    "    def __iter__(self):\n",
    "        for line in open(FILE_TOKEN):\n",
    "            # assume there's one document per line, tokens separated by whitespace\n",
    "            yield dictionary.doc2bow(line.lower().split())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "jobs_corpus = jobCorpus()\n",
    "corpora.MmCorpus.serialize(os.path.join(MODEL_DIR, \"train_jobs.mm\"), jobs_corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MmCorpus(10000 documents, 47674 features, 2170358 non-zero entries)\n"
     ]
    }
   ],
   "source": [
    "corpus = corpora.MmCorpus(os.path.join(MODEL_DIR, \"train_jobs.mm\"))\n",
    "print corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Dimensionality Reduction using Latent Semantic Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Since we do not know how many topics this corpus should yield so we decided to compute this by reducing the features to n = 10 dimensions, then clustering the points for different values of K (number of clusters) to find an optimum value. Gensim offers various transforms that allow us to project the vectors in a corpus to a different coordinate space. One such transform is the Latent Semantic Indexing (LSI) transform, which we use to project the original data to 50D."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "MAX_LSI_TOPICS = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 56s, sys: 5.14 s, total: 2min 2s\n",
      "Wall time: 2min 15s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "dictionary = corpora.Dictionary.load(os.path.join(MODEL_DIR, \"train_jobs.dict\"))\n",
    "corpus = corpora.MmCorpus(os.path.join(MODEL_DIR, \"train_jobs.mm\"))\n",
    "\n",
    "tfidf = models.TfidfModel(corpus, normalize=True)\n",
    "corpus_tfidf = tfidf[corpus]\n",
    "\n",
    "# reduce the vector space by projecting  to 10 dimensions\n",
    "lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics = MAX_LSI_TOPICS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# write coordinates to file\n",
    "fcoords = open(os.path.join(MODEL_DIR, \"train_jobs_lsi_coords.csv\"), 'wb')\n",
    "for vector in lsi[corpus]:\n",
    "    if len(vector) != MAX_LSI_TOPICS:\n",
    "        continue\n",
    "    v = '\\t'.join([ \"{:6.6f}\".format(x[1]) for x in vector ])\n",
    "    fcoords.write(v + '\\n')\n",
    "fcoords.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10000 ./models/train_jobs_lsi_coords.csv\n",
      "5.612125\t-0.142342\t-0.005066\t1.373977\t-4.532514\t4.445029\t4.274249\t1.326633\t0.668706\t-1.276696\n",
      "12.383553\t5.060576\t-2.666577\t0.423928\t3.141846\t1.353790\t0.326339\t0.964861\t-0.058433\t0.539940\n"
     ]
    }
   ],
   "source": [
    "!wc -l ./models/train_jobs_lsi_coords.csv\n",
    "!head -2 ./models/train_jobs_lsi_coords.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. K-Means Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we clustered the points in the reduced dimension LSI space using K-Means, varying the number of clusters (K) from 1 to 50. The objective function used is the Inertia of the cluster, [defined](http://scikit-learn.org/stable/modules/clustering.html#k-means) as the sum of squared differences of each point to its cluster centroid. This value is fed from Scikit-Learn K-Means algorithm. \n",
    "\n",
    "**Reference:**\n",
    "\n",
    "- [Stackoverflow](http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means)\n",
    "- [Data science central post by Vincent Granville](http://www.analyticbridge.com/profiles/blogs/identifying-the-number-of-clusters-finally-a-solution)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Determine Number of Topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "MAX_K = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = np.loadtxt(os.path.join(MODEL_DIR, \"train_jobs_lsi_coords.csv\"), delimiter=\"\\t\")\n",
    "ks = range(1, MAX_K + 1)\n",
    "\n",
    "inertias = np.zeros(MAX_K)\n",
    "diff = np.zeros(MAX_K)\n",
    "diff2 = np.zeros(MAX_K)\n",
    "diff3 = np.zeros(MAX_K)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAaMAAAEPCAYAAADvS6thAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8FuW99/HPLwshQFgCZZO1Fiux1qNWrdUeYq2ASlGJ\nCwqKlkrPUQoWa0k4VuFUcaviVpRaC7gvQF1OrQRr86BPH2tPW3u04CFYrSFIUDAQkLDdv+ePmTsM\nIYkhJJks3/frlVfmvu6Zua97Xpov1zW/mTF3R0REJE4pcXdAREREYSQiIrFTGImISOwURiIiEjuF\nkYiIxE5hJCIisWuyMDKzX5lZmZm9HWm7w8xWm9nfzGyZmXWLvFdgZsVm9q6ZjYy0H29mb4fv3RNp\nzzCzp8P2N8xscOS9SWa2Jvy5LNI+1Mz+GG7zlJmlN9X3FxGR+mvKkdFCYHS1tkLgKHc/BlgDFACY\nWQ5wEZATbjPfzCzc5gFgsrsPA4aZWXKfk4FNYfs84LZwX9nADcCJ4c+NkdC7Dbgz3ObTcB8iIhKz\nJgsjd3+N4A9+tG2FuyfCl38EBoTL5wBPuvtud/8AWAucZGb9gCx3fzNc7xHg3HB5LLA4XF4KnB4u\njwIK3b3c3cuBFcCZYbidBiwJ11sc2ZeIiMQoznNG3wVeCpf7A+si760DDquhvTRsJ/xdAuDue4At\nZtazjn1lA+WRMIzuS0REYhRLGJnZfwC73P2JZvpI3fNIRKQFS2vuDzSzy4Gz2DetBsEoZWDk9QCC\nEU0p+6byou3JbQYB680sDejm7pvMrBTIjWwzEHgV2Ax0N7OUcHQ0INxHTX1UeImINIC72+evdaBm\nHRmFxQfXAee4e2XkrReA8WbWwcyGAsOAN919A7DVzE4Kz/lcCjwf2WZSuHw+8LtwuRAYaWbdzawH\ncAaw3IM7wv4euCBcbxLwXG19dXf9uHPjjTfG3oeW8qNjoWOhY1H3z6FospGRmT0JjAB6mVkJcCNB\n9VwHYEVYLPf/3P0qd19lZs8Aq4A9wFW+75tdBSwCMoGX3P3lsP1h4FEzKwY2AeMB3H2zmf0U+FO4\n3hwPChkAZgJPmdlNwF/CfYiISMyaLIzc/eIamn9Vx/pzgbk1tP8ZOLqG9p3AhbXsayFBaXn19veB\nk2rvtYiIxEF3YJA65ebmxt2FFkPHYh8di310LBqHHeo8X1tkZq7jIiJycMwMbw0FDCIiIjVRGImI\nSOwURiIiEjuFkYiIxE5hJCIisVMYiYhI7BRGIiISO4WRiIjETmEkIiKxUxiJiEjsFEYiIhI7hZGI\niMROYSQiIrFTGImISOwURiIiEjuFkYiIxE5hJCIisVMYiYhI7BRGIiISO4VRHdyd/Pzbcfe4uyIi\n0qYpjOqwdOly5s//iGXLCuPuiohIm2b6V/+BzMxzcs5m585jeO+9mxg27HrS0//GtGnj+f73J8bd\nPRGRFsnMcHdryLYaGdVi9uyr2b49ARiVlQnmzJnKlCkT4u6WiEiblBZ3B1oqM2PbtkpSUmZQXp7A\nzDBrUOCLiMjn0MioFsXFJSxcOJqUlDv5xS/OpLi4JO4uiYi0WTpnVAMz8+RxGTQIVq6EIUPi7ZOI\nSEvXIs8ZmdmvzKzMzN6OtGWb2QozW2NmhWbWPfJegZkVm9m7ZjYy0n68mb0dvndPpD3DzJ4O298w\ns8GR9yaFn7HGzC6LtA81sz+G2zxlZumf9z3694f16w/1aIiISF2acppuITC6Wls+sMLdjwB+F77G\nzHKAi4CccJv5tu8EzQPAZHcfBgwzs+Q+JwObwvZ5wG3hvrKBG4ATw58bzaxbuM1twJ3hNp+G+6hT\n//7w0UcH+9VFRORgNFkYuftrBH/wo8YCi8PlxcC54fI5wJPuvtvdPwDWAieZWT8gy93fDNd7JLJN\ndF9LgdPD5VFAobuXu3s5sAI4Mwy304AlNXx+rfr108hIRKSpNXcBQx93LwuXy4A+4XJ/YF1kvXXA\nYTW0l4bthL9LANx9D7DFzHrWsa9soNzdEzXsq1aaphMRaXqxVdOFFQLNVT3R4M9RGImINL3mvs6o\nzMz6uvuGcApuY9heCgyMrDeAYERTGi5Xb09uMwhYb2ZpQDd332RmpUBuZJuBwKvAZqC7maWEo6MB\n4T5qNHv2bADWroV3382ttksRESkqKqKoqKhR9tWkpd1mNgR40d2PDl/fTlB0cJuZ5QPd3T0/LGB4\ngqDg4DDgFeBL7u5m9kdgGvAm8BvgXnd/2cyuAo529383s/HAue4+Pixg+G/gOMCAPwPHuXu5mT0D\nLHX3p83sQeAtd3+whn5XlXb/z//AJZfAO+801VESEWkbDqW0u8nCyMyeBEYAvQjOD90APA88QzCi\n+QC4MCwywMxmAd8F9gDT3X152H48sAjIBF5y92lhewbwKHAssAkYHxY/YGZXALPCrtzk7ovD9qHA\nUwTnj/4CTHT33TX0vSqMPvkEjjgCNm9upAMjItJGtcgwas2iYeQOHTvCli3BbxERqVmLvOi1rTCD\nvn11rZGISFNSGNWDKupERJqWwqge+vXTyEhEpCkpjOpBIyMRkaalMKoHhZGISNNSGNWDbpYqItK0\nFEb1oJuliog0LYVRPWiaTkSkaSmM6kFhJCLStBRG9ZCdDZ99Bjt2xN0TEZG2SWFUD2a61khEpCkp\njOpJU3UiIk1HYVRPGhmJiDQdhVE9aWQkItJ0FEb1pDASEWk6CqN60jSdiEjTURjVk0ZGIiJNR2FU\nTwojEZGmozCqp379oLTUyc+/HT2qXUSkcSmM6im4C8Ny5s//iGXLCuPujohIm6IwqocFCx7jK18Z\nA7xGRcVdFBSs5KijxrBgwWNxd01EpE1Ii7sDrcGUKRPIzu7JpZeuZM8eo7Iywdy5U8nLGxV310RE\n2gSFUT2YGWbG3r2V9Os3g/LyRFWbiIgcOoVRPRUXl3DeeaMZNmwkxx1XSHFxSdxdEhFpM0yVYQcy\nM6/puCxeDCtWwGM6VSQicgAzw90bNGWkAoaDMHQovP9+3L0QEWl7FEYHYcgQ+OCDuHshItL2aJqu\nBrVN0+3dC506wZYt0LFjDB0TEWnBNE3XTFJTYcAA+PDDuHsiItK2xBJGZlZgZn83s7fN7AkzyzCz\nbDNbYWZrzKzQzLpXW7/YzN41s5GR9uPDfRSb2T2R9gwzezpsf8PMBkfemxR+xhozu+xg+z50qKbq\nREQaW7OHkZkNAa4EjnP3o4FUYDyQD6xw9yOA34WvMbMc4CIgBxgNzLd9F/g8AEx292HAMDMbHbZP\nBjaF7fOA28J9ZQM3ACeGPzdGQ68+hgxREYOISGOLY2S0FdgNdDKzNKATsB4YCywO11kMnBsunwM8\n6e673f0DYC1wkpn1A7Lc/c1wvUci20T3tRQ4PVweBRS6e7m7lwMrCAKu3lTEICLS+Jo9jNx9M3An\n8CFBCJW7+wqgj7uXhauVAX3C5f7Ausgu1gGH1dBeGrYT/i4JP28PsMXMetaxr3pTebeISONr9jsw\nmNnhwDXAEGAL8KyZTYyu4+5uZrGW+c2ePbtqOTc3l9zcXEAjIxGRpKKiIoqKihplX3HcDuhrwB/c\nfROAmS0DTgY2mFlfd98QTsFtDNcvBQZGth9AMKIpDZertye3GQSsD6cCu7n7JjMrBXIj2wwEXq2p\nk9EwitLISEQkEP2HOsCcOXMavK84zhm9C3zdzDLDQoRvA6uAF4FJ4TqTgOfC5ReA8WbWwcyGAsOA\nN919A7DVzE4K93Mp8Hxkm+S+zicoiAAoBEaaWXcz6wGcASw/mM737RtcZ/TZZwf3pUVEpHbNPjJy\n97+Z2SPAfwMJ4C/AL4As4Bkzmwx8AFwYrr/KzJ4hCKw9wFWRK1KvAhYBmcBL7v5y2P4w8KiZFQOb\nCKr1cPfNZvZT4E/henPCQoZ6S0mBwYPhn/+E4cMP+uuLiEgNdAeGGtR2B4akUaNg+nQ466xm7JSI\nSAunOzA0M134KiLSuBRGDaALX0VEGpfCqAFU3i0i0rgURg2g8m4RkcalMGoAjYxERBqXwqgBevcO\nrjOqqIi7JyIibYPCqAHMYPBg55prbkel8SIih05h1ECZmct58smPWLasMO6uiIi0egqjg7RgwWMc\nddQY/vGP19ix4y4KClZy1FFjWLDgsbi7JiLSasVxo9RWbcqUCWRn9+T7318JGJWVCebOnUpe3qi4\nuyYi0mopjA6SmWFm7NpVSVraDMrLE1VtIiLSMJqma4Di4hIWLRrNF75wJ3PnnklxcUncXRIRadV0\no9QafN6NUpMmToQRI+DKK5uhUyIiLZxulBqTb30LXq3x0XwiInIwNDKqQX1HRu+/DyefDB99FFx7\nJCLSnmlkFJOhQyEzE959N+6eiIi0bgqjQ3TaaZqqExE5VAqjQ3TaafD738fdCxGR1q1e54zMbAxw\nFNARcAB3/8+m7Vp86nvOCGDdOviXf4GNGyFF0S4i7ViTnjMyswXAhcAPwqYLgcEN+bC2aMAA6NHD\n+d73dNNUEZGGqs+/5b/h7pcBm919DvB14MtN263WZciQ5TzxhG6aKiLSUPUJox3h78/M7DBgD9C3\n6brUeiRvmvrOO6+xc6dumioi0lD1uTfdf5lZD+AO4M9h20NN16XWI3nT1B/+MLhp6rZtCe69VzdN\nFRE5WJ8bRpFChaVm9hugo7uXN223WofkDVK3bq2kV68ZfPyxbpoqItIQtYaRmZ3u7r8zszzCCrrI\ne7j7sibvXStQXFzCwoWjOeOMkQwaVMjKlSXk5cXdKxGR1qXW0m4zm+PuN5rZIqqFEYC7X9HEfYvN\nwZR2R917LyxfDv/1X05BwR3ccst1GiWJSLtxKKXdn3udkZl90d3/8XltbUlDw2jnTjjySLjiipf5\n2c+Ws3DhaJ0/EpF2o6nvTbekhrZnG/Jhbd2iRY+xa9cYbrnlNSoqVF0nIlJftYaRmQ0Pzxd1N7Nx\nZpYX/r6c4E4MDWZm3c1siZmtNrNVZnaSmWWb2QozW2NmhWbWPbJ+gZkVm9m7ZjYy0n68mb0dvndP\npD3DzJ4O298ws8GR9yaFn7HGzC47lO9R3ZQpE5g372r27EmQfCT5nDlTmTJlQmN+jIhIm1PXyOgI\n4DtAt/D3mPD3ccChPk7uHuAldx8OfBV4F8gHVrj7EcDvwteYWQ5wEZADjAbm274TMQ8Ak919GDDM\nzEaH7ZOBTWH7POC2cF/ZwA3AieHPjdHQO1RmRkqKkZpaSdeuMygv36HqOhGReqg1jNz9eeB7wJ3u\nfkXkZ5q7/6GhH2hm3YBvuvuvws/Z4+5bgLHA4nC1xcC54fI5wJPuvtvdPwDWAieZWT8gy93fDNd7\nJLJNdF9LgdPD5VFAobuXh+XpKwgCrtEUF5fw4PyRXLzrq3yY9WvGXjwWOnWCnBxYvBgSicb8OBGR\nNqHOc0buvgc4r5E/cyjwsZktNLO/mNlDZtYZ6OPuZeE6ZUCfcLk/sC6y/TrgsBraS8N2wt8lke+w\nxcx61rGvRlMw+RwuXziXByuvoPv6D0nfvQt27IDVq+Hyy4PnlG/c2JgfKSLS6tXnDgyvm9n9wNPA\ndsAAd/e/HMJnHgdMdfc/mdndhFNySe7uZhbrXUdnz55dtZybm0tubu7nb7RxY/BMiVWr2EAffszt\n/DrM8vP4NXdwHX1ef33fcyd6926azouINIOioiKKiooaZV/1Ke0uoubrjE5r0Aea9QX+n7sPDV+f\nChQAXwROc/cN4RTc7939SDPLDz/v1nD9l4EbgX+G6wwP2y8G/tXd/z1cZ7a7v2FmacBH7v4FMxsP\n5Lr7v4XbLABedfenq/Xx4Eu7E4lg1PP66/ydHL7Fq2ysGtwFelPGq3yLo1gFp54KK1fqeeUi0mY0\naWm3u+e6+2nVfxryYeH+NgAlZnZE2PRt4O/Ai8CksG0S8Fy4/AIw3sw6mNlQYBjwZrifrWElngGX\nAs9Htknu63yCggiAQmBkWM3XAzgDWN7Q77KfRx+F119nA31qDCKAjeF7ZfSG118PthERkXqNjPoC\nNwOHufvosLrtZHd/uMEfanYM8EugA/AecAWQCjwDDAI+AC5M3gPPzGYB3yW4Y/h0d18eth8PLAIy\nCarzpoXtGcCjwLHAJmB8WPyAmV0BzAq7cpO7Jwsdov07+JFRTg6sXs1lLOZR6q4Yv4zFLOZyGD4c\nVq06uM8REWmhmvoODC8DC4H/cPevmlk68Fd3/0pDPrA1aFAYdeoEO3aQxVa2kVXnql2ooIKuwTbb\ntx9CT0VEWo6mvgNDr/Ccyl4Ad99NMEIRERFpFPUJo21hWTQAZvZ1YEvTdamVGjIECKrmPs84whue\nD9bT20VEoH5hdC1BccEXzewPBOdipjVpr1qjmTMBuIPr6E1Zrav1pozb+XHwIj+/1vVERNqTzz1n\nBBCeJ/py+PJ/w6m6Nkul3SIiB69JCxjCDzgFGEJwwaoDuPsjDfnA1qChj5CIXvRaRm9+zO0sYxwQ\nTM3dzo/pw8ag8k4XvYpIG9PU1XSPEVyQ+hZhEQOAu/+gIR/YGjQ4jCAIpLy84Dqimpx6KixdqiAS\nkTanqcNoNZDT8L/Orc8hhRGAe3BB6623wj//CcC2noO5Mz2fG9deqqk5EWmTmjqMniW40HR9Qz6g\nNTrkMKrB7t3Qv7+Tl3cHDzygx5GLSNvT1NcZfQFYFT7w7sXw54WGfFh7lp4Oxx+/nEWLPmLZssK4\nuyMi0qLUZ2SUW1O7uxc1QX9ahMYeGS1Y8Bj33vsUFRXHUFJyE8OGXU96+t+YNm083//+xEb7HBGR\nOB3KyOhzHyHRlkOnuUyZMoHs7J5ce+1KwNi6NcH9908lL29U3F0TEWkRag0jM9tGDY+OCLm7d22a\nLrU9yUePl5dX0qvXDDZvTuhx5CIiEbWGkbt3ac6OtHXFxSUsXDiaY48dyTHHFLJ6dUncXRIRaTHq\nddFre9MU1XRRubkwbRqMG9dkHyEi0uyauppOGtmkSbD4gKcoiYi0XxoZ1aCpR0YVFTBwIKxZoxsx\niEjboZFRK5OVBWPHwuOPO/n5t6N/EIhIe/e5pd3SNCZNgsmTl7N580eccEKhyrxFpF3TyCgGCxY8\nxrRpYygtfY2KirsoKFjJUUeNYcGCx+LumohILDQyikHyItjvfW8lW7calZUJ5s7VRbAi0n4pjGKQ\nvOB1795KUlNnUF6ui2BFpH1TGMWkuLiExYtH87OfjSQ3t5DiYl0EKyLtl0q7a9DUpd1RCxbAK6/A\ns882y8eJiDSZJn/seHvTnGFUXg5DhsA//gHZ2c3ykSIiTULXGbVi3bvDqFHw9NNx90REJD4KoxZg\n0iRYtEgXwIpI+6Vpuho05zQdwJ490KvXy+zZs5zFi0erxFtEWiVN07ViCxY8xjHHjMHsNbZvv4v8\n/JXk5JzN6NGXaZQkIu1GbGFkZqlm9lczezF8nW1mK8xsjZkVmln3yLoFZlZsZu+a2chI+/Fm9nb4\n3j2R9gwzezpsf8PMBkfemxR+xhozu6y5vm9tpkyZwOzZV9O5cwIwPvwwwSmnnMof/pDNsmWFcXdP\nRKRZxDkymg6sYt/TZPOBFe5+BPC78DVmlgNcBOQAo4H5tu/q0AeAye4+DBhmZqPD9snAprB9HnBb\nuK9s4AbgxPDnxmjoxSF5sevWrZX07TuWXbtW8PDDZVRUzNNtgkSk3YgljMxsAHAW8EsgGSxjgeRT\nfhYD54bL5wBPuvtud/8AWAucZGb9gCx3fzNc75HINtF9LQVOD5dHAYXuXu7u5cAKgoCLVfIpsKWl\nzzFjRh6dOu0FjE2bEsyZM5Urr7xExQ0i0qbFNTKaB1wHJCJtfdy9LFwuA/qEy/2BdZH11gGH1dBe\nGrYT/i4BcPc9wBYz61nHvmJVUHAleXmjSElJ4eSTjyUlJY3DD5/Bp5/u4De/MZYtK2T+/I80bSci\nbVaz3w7IzMYAG939r2aWW9M67u5mFuswYPbs2VXLubm55ObmNsvnJkdJ48aNZMKEWTzyyH/y4ou5\n4d29r+eGG+5j2rTxfP/7E5ulPyIitSkqKqKoqKhR9hXHvem+AYw1s7OAjkBXM3sUKDOzvu6+IZyC\n2xiuXwoMjGw/gGBEUxouV29PbjMIWG9maUA3d99kZqVAbmSbgcCrNXUyGkbNqaDgyqrlxx+fy2mn\nvczVV68Egrt733zz1fz5z/+Du+vGqiISq+r/UJ8zZ06D99Xs03TuPsvdB7r7UGA88Kq7Xwq8AEwK\nV5sEPBcuvwCMN7MOZjYUGAa86e4bgK1mdlJY0HAp8Hxkm+S+zicoiAAoBEaaWXcz6wGcASxvsi97\niMyMHj2MjIxKOneewfr1O3jttbeYP3+DpuxEpE1pCdcZJafjbgXOMLM1wLfC17j7KuAZgsq73wJX\nRa5IvYqgCKIYWOvuL4ftDwM9zawYuIawMs/dNwM/Bf4EvAnMCQsZWqzi4hIWLRrN7bcfS0bGW8yf\nv04P5BORNkd3YKhBc9+BoT7cnWeffZkpU1ayZcst9O9fwD33jCAvb5Sm60SkRdAdGNoBMyMlxUgk\nKundewYffbSDLVugoOAOlXyLSKunMGpFkpV2GzbcSV7emVx77Sv8/Of7Sr7ddbNVEWmdNE1Xg5Y4\nTRe1YMFj3HvvU6xffwzl5TcxYMD1dO36N0455WieeqqShQt1s1URaX6apmtnkvezy8oK7mdXWrqa\n4uJtvPCCqbhBRFolhVErlLyfXXl5JTk5M+jceSBjxozi00+d5PVIc+ZMZcqUCXF3VUSkXhRGrVTy\n/NE779zJokVnkZm5iQ4dKklPn8Enn+yoCiwRkdZA54xq0NLPGdXkllse4ogjBjFo0EhOP72QKVM+\nJC3tU2655TqFkog0i0M5Z6QwqkFrDKOo++6DefNe5pNPlquYQUSajQoYpMqCBY/x4INj2LjxNSoq\n7uLHPw6KGR588NGqsm+VgItISxPHjVKlCU2ZMoHs7J7MmLGS7duN999PcOGFU+nSxZk//y+ccEIh\n7s78+R9xwgmFGjWJSIugaboatPZpuiVLXua7313OwIHG2rXFpKZWsGvXKezdeySpqfeRkjKc3bsX\nMWzY9aSn/02PpBCRRqFpOtlPtNLu8cev5txzj+cLX0gAE0lJOYLdu3sAxo4dCWbPvpr33y/VlJ2I\nxEojoxq09pFRddGR0j/+UQr0pkOHdLZtSzB5ch+eemqjCh1E5JCpmq6RtbUwSpZ9jxs3kokTfwzA\nN7/5VfLzH2br1qNwv79qyu4HP7iIDz74SCXhInLQFEaNrK2FUU3cnSVLgqfIfvzxLXTsWMDcuSMY\nMMCZPLmw6tHnBQV3KJhEpF50zkgOWvIODZWVlRx55Ax2736HH/3oViZPfr3q/naDBo3g7rs/ZNmy\nQpWDi0iTUhi1Y8lCh1Wr7uSpp65m3LjjMUsAj7N2bSGffXYcO3fed0AwiYg0Nk3T1aA9TNPVJFno\n0L8/vPdeKe792bv3a5jdQ6dOp7B9+7wazy0Bms4TEU3TSeNIjpRWr76L6dOPJS1tN8OG/Zm0tKEk\nEgDBRbQnnjiVzMxezJ8fPNhv6dLlVcvVaXpPROpDI6MatNeRUVS0Au+6627l/vvXcfjhGRQXF5OW\nVkFl5Sm4H0lKyn2YDWfv3uAi2rS0txg0qCe//e1izKxqtKXScZG2T9V0jUxhtL9oMC1dupxly1aw\ncmUHSkvnkpl5KYlEL3buvJtOnQo4//yu/PrXZVx8cSavv/42u3cfQ3HxTTUGlYi0LQqjRqYwqlv1\ni2jNetO16z8pK1uP2am4zyMt7T/Ys+d14F+Ae+nTp4AJE7ry0ENlLFx4pkZJIm2QzhlJs4rebmjc\nuEGcd15H1q9/jhkz8ujadS9g9OrlXHXVmWRlpZCVNZayshXcf38ZFRXz9Fh0ETmAwkgOWkHBleTl\njcLMePzxO3j88TtISUnh5JOPJZFIIydnBtu376C8fDMLF55JeflzXHNNHqmpQVB9/HGCG2/c/554\ntRU6qABCpH1QGEmjiY6YFi48k6OP/jJ5eaNISUnhlFOOJS0tjWHDZlBRsYOZM9/i5z/fUFWBF63I\niwZQ9Uo9hZNI26RzRjXQOaPGlyyC+OSTjfz0pw+zefNR7NhxP2YX4L6G1NQR7N17L4cffj07d77G\nRx9l0a/fdsxOoaTkJgYPvp7Onf/GKacczVNPVX5udZ67V137BLoOSqQ5qIChkSmMmk7ynnjXXruS\nkpJb6Ncvn9Gju/Kb32xl48avAPfQufMpbN9+F6mpk9i7NwP4JTCOtLTNdO36DTZvvvmA6jzYP3Ci\nJeXurvJykWagAgZpNZL3xCsvryQnZwbbtlWSnW3s2LGT4cP/TEbGUPbuBUihc2cnMzOTnJwf0qXL\nQPLyRrFzpwNGeXmCM888lT/8IXu/C2//7d9mcdRRY5g16zUqKo7joot+woUXPll1v73qhROa9hNp\nGZo9jMxsoJn93sz+bmbvmNm0sD3bzFaY2RozKzSz7pFtCsys2MzeNbORkfbjzezt8L17Iu0ZZvZ0\n2P6GmQ2OvDcp/Iw1ZnZZc31v2af6uaV33vkHCxeO5u9/v4upU4/FfTc5OTOorNzK1Vcfxjvv3MWi\nRWeRmrqJlJRKevQYy8cfr2DBgjIqKo7n4ot/wsUXB4Hz9NPG2rVbKS6uACaSmnoEnToFDxMsLU1w\nww37F07UdfcIEWk+zT5NZ2Z9gb7u/paZdQH+DJwLXAF84u63m9lMoIe755tZDvAEcAJwGPAKMMzd\n3czeBKa6+5tm9hJwr7u/bGZXAV9x96vM7CLgPHcfb2bZwJ+A48Pu/Bk43t3Lq/VR03QxiV5gu2xZ\nIcXFJeTnf2+/98477wx+8IPbeOih9ezefS8pKZeSktKLPXvuplu3Ar797a4sX17GoEEpVddBDR6c\nTnFxgmHD+lBaupGLL+7I66+/zc6dx/Dee59/UW70HJTOO4nUrFVN07n7Bnd/K1zeBqwmCJmxwOJw\ntcUEAQVwDvCku+929w+AtcBJZtYPyHL3N8P1HolsE93XUuD0cHkUUOju5WEArQBGN/63lIaKlo3n\n5Y2qCqKQPsI8AAAP10lEQVToeykpKZx22rF07JhGTs61dOiwm/T0veTkzCCR2EFGxmYWLTpzv+ug\nrrnmWPr2fYv33ltHRcVdPPaY8e67W3nvvQrA2LgxwYgR+6b9gDqr+kSkccV6zsjMhgDHAn8E+rh7\nWfhWGdAnXO4PrItsto4gvKq3l4bthL9LANx9D7DFzHrWsS9pZWq68LZ6SXn0OqgpUyYyb95M+vTp\nChidOjmXXx5clNuz51i2bElO+83j3/99JV/84hiuvHIW8+a9Rq9epzFlymtV551ycs5m9OjLarxG\nqrHPQemclrQXsYVROEW3FJju7hXR98I5Mv3fJ7Wq6cLbmkZTScnCiS1bgsKJnTt3UFkZXJS7cWNw\n94isrOCi3C1bVlNSso2HHzZ27Xqe7dsHsmVLGcm7lg8duv8IKjpqqu16Kag9tOoKHI3IpL1Ii+ND\nzSydIIgedffnwuYyM+vr7hvCKbiNYXspMDCy+QCCEU1puFy9PbnNIGC9maUB3dx9k5mVArmRbQYC\nr9bUx9mzZ1ct5+bmkpubW9Nq0ookR1PR81HJUu+TTz6Whx7aSE7ODD78cCBTppzAs89upaQkhS5d\nnM8+y6Rbt+9QVvYRK1acyu7d8zj//AuA64ARwIlceOFPSEkZzp49iygouJ5rrrmZjz/+KiecUEhe\n3qiqYDnhhCCoalrOyxuFu3PmmZfx4YefsnnzMeGI7Hp+8pN79zunVdt5LJ3fkuZSVFREUVFR4+ws\n+S+z5voBjOD8zrxq7bcDM8PlfODWcDkHeAvoAAwF3mNf4cUfgZPCfb4EjA7brwIeCJfHA0+Fy9nA\nP4DuQI/kcg19dGlf5s79hS9Z8rInEglfsuRlv+SSaz0r6xrPyfmhd+jwHf/Rj27xvXv3+owZc71b\nt6kO7v37z/Srr77ZDztspkPCMzMneHr6dIdHHb7mmZnTHRLeqVOep6Qc7R07TnV4xM1OcLisatns\nMoeE9+o1y4cMOdsvvzzfO3SY7l26zPWMjHwH9/T0fB83bq5nZU33JUtednf3Z5/9rWdlXVPV75kz\nb/NEIrFfe12i24g0hvBvZ4OyIY5pulOAicBpZvbX8Gc0cCtwhpmtAb4VvsbdVwHPAKuA3wJXhV8a\ngtD5JVAMrHX3l8P2h4GeZlYMXEMQbrj7ZuCnBBV1bwJzvFolnbRP1QsnvvKVL1edk3riiavp2bPX\nAfffq6iopGNHY+vWneTkXIv7btLS9nLkkcH1UikpAEZa2pf45jfH06lTZ2AiXbocQbduPYCJZGUd\nQVZWUHpeUbGa0tJtLFpk7No1j4yM1ezevZmuXb/Nnj0reP754JzWhAkPkZr6VS6//DdUVNxFfn7w\nWPif/ez/MHDgaUyfHpzfmjkzuK7qwQcfrXFKULdakhaloSnWln/QyEhqUX0ENWrU96peX3LJj/yS\nS37kiUTCr712rmdkXOU5OT/0rKzpfu21c6tGWh07XuiZmVMPWO7SZZrPmDHXBw4MRkPduo3za6+d\n63v27NlvRJadPdNzc2/2rl1nVhuF7XWY6DDZwT0lJd+vvvq3/swzL1WNlJ599reekTHGBwwY4cOG\nzXJI+LBhszwn52y/8sr8Gkda1UdQdY2o6tpO2j4OYWQU+x/+lvijMJJDVd/Qii5Xnx6saUquergN\nH36NZ2Rc4J07T3dw79FjQhhu13h6+hhPTx/hHTvOcnjEU1KSU4J7PSVlX2jBeZ6WNsI7dy5wSPiX\nvjTLBwz4pmdkXF0VYNFpv9qmB6u/V9d6DQmtxl5PGp/CSGEkbUT1ELvllodqbI+GW3QUljy/lTx3\nNH78DM/Ozg/PXU3wLl2C0OrefYJ37BiEVpcu0/zSS+d6z575VSOtTp2mVwuwhKenB+e+ghFaMKJK\nhtaUKfmek3O2f+lLQfClp5/g6emXHbBeTYFW39Cqa72o+u4vSgHWOBRGCiNpx2oLMPf9R1TRKcFo\naEVHZMmRVrL4IjNzgnftGgRYr14z/etfv9k7dtw3PZiaGqyXllbg8E2HHzgkHCa42b5ijpSU5Hp5\nbna0p6dPrQq7lJQgtAYMmOX9+h0YWjVNK1YPt0Qi4aNGTfScnLP9i1+sfb36Fn3UNnKr7xRlXdpy\n8CmMFEYiNYoGVfUpwWhoRdeLjrSiAVZ9erBjxwu8e/cgqPr2zferrgqq/aLbVV+vd++ZfuGFN3vv\n3kEFYufOE8IpxiC0IAgtszyHoz01dWq1c2HBekHQJbxbt1k+YMDZPmFCvmdkTPdBg+a62b4RXlZW\nsF52dp6npx/t3bvvP6pLTT3LBwwY4YcfXvv5s9pGce71H7nVN/iiGntU1xwhqDBSGIk0mroCrLbp\nways6X7JJT9qcDFHMrR69kyG20y/7LKbvU+fmfudCzvyyOneseMFnp0drNex43mekTHCzQrCcLrU\nzaZ4nz6Xe3r6BZ6REayXmjrT+/S52TMy9o3qgjL8vW62//mzjIwR3qNHQbVpygNHcUccEYRWcuQW\nTFEeeM7Nff8RXnK9zyscqe+5uvqO6hpa8n8wI0OFkcJIpNnVNT1Y13oNqUCMTitG16utAnHv3r21\n7i85Fdmt24FFH507T/Nx4+Z6t275VdOUwcgtWO7YMRlu53lKSjIEDxy5RUd4Zkd7hw5TDwi+1NTz\nvGvXET5wYBCkhx8+yw87LBitdes2wjt1CkIrNTXYR0ZGMKobPHj/UV10WrKmUV0yBKuv98ADj9Qr\ntOp7fu/ZZ397SGGkh+vVQHftFml+1e/Y/tBDS7jyyvMPuIN79fWWLVvBiy/uZeBAo6QkwcKFZ5KX\nN6rO/V133a3cf/86Dj88g7Vr1zJt2je4/faZB+wvedf3oUPT91v+8MO9TJnSl6ef3kpp6S307DmR\nzz7rwZAhqbz//no6d+7Ppk1307dvPiNGdOWVV7ayadOtdO8+kZ07ezB4cBoffJDgyCP78s47W9mz\n5yjgHoLLMO8iM3MSO3dmkEj8kt698zn11K4UFW1l8+bgAZRmp+B+FzAJ2PcAyszMzXTu/A0++WQ4\nKSn3AcNJJBZiNgn3YL3MzAJuu20Effs6kycXVj2AcuLEn/OFL1TQseMprF17E927X0BFxRo6dBjB\njh0nkpoa7G/v3kUMHnw9u3a9xubNX2Xs2CzeeONtKiuP4eOP5+J60mvjURiJtB51PXakIdtE35s4\n8ccAPPbY7fstVw+taKBFg66kJMGUKX34xS82HrBedB8DBsD775dWhVgy3IYOTaOkxKv2kVwvM7M/\nn356N9nZE9mxowdDhqTxz38m+Nd/7cvrr29l27a5ZGZeilkvPvvs7v1CsLh4LYlEBVlZp7B165GY\n3Yf7cGAh0XBLT8/ny1/uyrp1Wykvv4WMjEtx78WuXV9jXyDOIy3tP0hNfR34F3buvE9h1JgURiLy\neWoLtPqO8Krvo67R2qGO6qLrLV26nKeeWsErr3Rgy5a5ZGVdSlpaLz79tPYQjO5vyJC0/YJz4MAC\nLrigKw89VEZFxT0Ko8akMBKR5lbfEV5DRnXV11uy5GW++93ldYZW9RCM7q/66O8730ln3Lhvc/75\noxVGjUlhJCJtWX1Dqz7bR7c5lCe9KoxqoDASETl4reqx4yIiItUpjEREJHYKIxERiZ3CSEREYqcw\nEhGR2CmMREQkdgojERGJncJIRERipzASEZHYKYxERCR2CiMREYmdwkhERGKnMBIRkdgpjEREJHYK\nIxERiZ3CSEREYtcuw8jMRpvZu2ZWbGYz4+6PiEh71+7CyMxSgfuB0UAOcLGZDY+3Vy1XUVFR3F1o\nMXQs9tGx2EfHonG0uzACTgTWuvsH7r4beAo4J+Y+tVj6H20fHYt9dCz20bFoHO0xjA4DSiKv14Vt\nIiISk/YYRh53B0REZH/m3r7+NpvZ14HZ7j46fF0AJNz9tsg67eugiIg0Ene3hmzXHsMoDfhf4HRg\nPfAmcLG7r461YyIi7Vha3B1obu6+x8ymAsuBVOBhBZGISLza3chIRERanvZYwFCr9nwxrJkNNLPf\nm9nfzewdM5sWtmeb2QozW2NmhWbWPe6+NhczSzWzv5rZi+HrdnkszKy7mS0xs9VmtsrMTmrHx6Ig\n/H/kbTN7wswy2suxMLNfmVmZmb0daav1u4fHqjj8mzry8/avMArpYlh2Az9096OArwNXh98/H1jh\n7kcAvwtftxfTgVXsq8Bsr8fiHuAldx8OfBV4l3Z4LMxsCHAlcJy7H00wzT+e9nMsFhL8fYyq8bub\nWQ5wEcHf0tHAfDOrM28URvu064th3X2Du78VLm8DVhNcfzUWWByuthg4N54eNi8zGwCcBfwSSFYH\ntbtjYWbdgG+6+68gOOfq7ltoh8cC2Erwj7ZOYSFUJ4IiqHZxLNz9NeDTas21ffdzgCfdfbe7fwCs\nJfgbWyuF0T66GDYU/gvwWOCPQB93LwvfKgP6xNSt5jYPuA5IRNra47EYCnxsZgvN7C9m9pCZdaYd\nHgt33wzcCXxIEELl7r6CdngsImr77v0J/oYmfe7fU4XRPqrkAMysC7AUmO7uFdH3PKh2afPHyczG\nABvd/a/sGxXtp70cC4KK2+OA+e5+HLCdatNQ7eVYmNnhwDXAEII/tl3MbGJ0nfZyLGpSj+9e53FR\nGO1TCgyMvB7I/sne5plZOkEQPeruz4XNZWbWN3y/H7Axrv41o28AY83sfeBJ4Ftm9ijt81isA9a5\n+5/C10sIwmlDOzwWXwP+4O6b3H0PsAw4mfZ5LJJq+3+i+t/TAWFbrRRG+/w3MMzMhphZB4KTby/E\n3KdmY2YGPAyscve7I2+9AEwKlycBz1Xftq1x91nuPtDdhxKcoH7V3S+lfR6LDUCJmR0RNn0b+Dvw\nIu3sWBAUbnzdzDLD/1++TVDg0h6PRVJt/0+8AIw3sw5mNhQYRnCDgVrpOqMIMzsTuJt9F8PeEnOX\nmo2ZnQqsBP6HfcPpAoL/gJ4BBgEfABe6e3kcfYyDmY0ArnX3sWaWTTs8FmZ2DEEhRwfgPeAKgv9H\n2uOx+DHBH90E8Bfge0AW7eBYmNmTwAigF8H5oRuA56nlu5vZLOC7wB6Caf/lde5fYSQiInHTNJ2I\niMROYSQiIrFTGImISOwURiIiEjuFkYiIxE5hJCIisVMYibRwZrYtsnyWmf2vmQ2saxuR1qbdPelV\npBVyADM7neBxDiPdvaTuTURaF4WRSCtgZv8K/AI4093fj7s/Io1Nd2AQaeHMbDfBs3RGuPs7cfdH\npCnonJFIy7cL+L8E90ETaZMURiItXwK4EDjRzAri7oxIU9A5I5FWwN0rzexs4DUzK0s+BlykrVAY\nibR8DuDun5rZaGClmW109/+KuV8ijUYFDCIiEjudMxIRkdgpjEREJHYKIxERiZ3CSEREYqcwEhGR\n2CmMREQkdgojERGJncJIRERi9/8B9Fbs3RehDOMAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7ff069d2c1d0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "for k in ks:\n",
    "    #kmeans = KMeans(k).fit(X)\n",
    "    kmeans = MiniBatchKMeans(n_clusters=k, init='k-means++', n_init=1, init_size=1000, batch_size=1000).fit(X)\n",
    "    inertias[k - 1] = kmeans.inertia_\n",
    "    # first difference    \n",
    "    if k > 1:\n",
    "        diff[k - 1] = inertias[k - 1] - inertias[k - 2]\n",
    "    # second difference\n",
    "    if k > 2:\n",
    "        diff2[k - 1] = diff[k - 1] - diff[k - 2]\n",
    "    # third difference\n",
    "    if k > 3:\n",
    "        diff3[k - 1] = diff2[k - 1] - diff2[k - 2]\n",
    "\n",
    "elbow = np.argmin(diff3[3:]) + 3\n",
    "print elbow\n",
    "\n",
    "plt.plot(ks, inertias, \"b*-\")\n",
    "plt.plot(ks[elbow], inertias[elbow], marker='o', markersize=12,\n",
    "         markeredgewidth=2, markeredgecolor='r', markerfacecolor=None)\n",
    "plt.ylabel(\"Inertia\")\n",
    "plt.xlabel(\"K\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** We plotted the inertias for different values of K from 1 to 100. Using the approach of calculating the third differential to find an elbow point, the elbow point happens here for K=6 or 7 and is marked with a red dot **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pandas.tools.plotting import scatter_matrix\n",
    "X = np.loadtxt(os.path.join(MODEL_DIR, \"train_jobs_lsi_coords.csv\"), delimiter=\"\\t\")\n",
    "df = pd.DataFrame(X, columns=range(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "NUM_TOPICS = 5\n",
    "\n",
    "X = np.loadtxt(os.path.join(MODEL_DIR, \"train_jobs_lsi_coords.csv\"), delimiter=\"\\t\")\n",
    "kmeans = MiniBatchKMeans(n_clusters=NUM_TOPICS, init='k-means++', n_init=1, init_size=1000, batch_size=1000).fit(X)\n",
    "y = kmeans.labels_\n",
    "\n",
    "colors = [ \"peru\", \"dodgerblue\", \"brown\", \"darkslategray\", \"lightsalmon\", \"orange\", \"springgreen\", \"orangered\", \"yellow\", \"firebrick\" ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({0: 3994, 1: 107, 2: 1968, 3: 197, 4: 3734})"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff05eefa690>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0603e6350>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05e32b350>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff049a53090>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff049a8ed90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff049a16810>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05e24fd50>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05e1c7e90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05e0a3e90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05dfa1c90>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff05def9650>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05ddc6890>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05dd4b5d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05cc8ef10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05cc21090>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05cb7ab10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05cafad50>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05cba8f10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05ca6f350>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c9742d0>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c957890>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c89b5d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c801050>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c777ed0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c6fdc10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c6eb850>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c6707d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c5d7510>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c499550>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c545cd0>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c37fd10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c304c90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c273290>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c26af90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c1c6a10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c1568d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c0d9610>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05ef84f10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff060b6f310>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff061a36590>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff062edec10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05c06ba90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057fa5550>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057f224d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057e87210>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057e0a250>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057eb4090>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057d71a10>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057cf7990>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057c59f50>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff057bdec90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057b45710>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057ac85d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057a4d310>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057a3d050>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0579b1c50>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057919ad0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05789c810>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05793dfd0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057790410>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff057715050>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0576f8690>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05767b3d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0575cafd0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057557e90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0574dbad0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05744a910>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0573d0550>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0573b63d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05733a110>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff057359d90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05721fcd0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0571a5910>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff057109f50>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05708dc90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0570668d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056ff6790>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056f793d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056ee8210>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056e61e10>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff056dc6c90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056d499d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056deb7d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056cbd5d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056bc2210>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056ba5850>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056b29590>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056a8f1d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056a13090>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff05698bc90>],\n",
       "       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff05697bad0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056880710>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056865590>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0567e72d0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056807110>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0566cde90>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056652ad0>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0565c6150>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff0565bbe50>,\n",
       "        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff056513a90>]], dtype=object)"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
No results found