Tuesday 22 June 2021

Renaming pandas column without knowing column name.

#rename unknown pandas column. #rename columns I do not know the name of. #renaming pandas column

df.rename(columns={df.columns[0]:'data'},inplace=True)

Excel VBA: splitting cell values into rows

Delimit cell values into rows.

Public Sub delimitcell()

    Set targetRange = Range("A1")
    targetDelimiter = ";"
    Set newRange = targetRange.Offset(1, 0)
    
    For i = 0 To UBound(Split(targetRange, targetDelimiter))
        Set newRange = newRange.Offset(1, 0)
        newRange.Value = Split(targetRange, targetDelimiter)(i)
    Next
End Sub

Thursday 3 June 2021

A simple machine learning Sklearn

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

data = pandas.read_csv('iphone_price.csv')

plt.scatter(data['version'], data['price'])

plt.show()

model = LinearRegression()

model.fit(data[['version']], data[['price']])

print(model.predict([[30]]))

iphone file csv: download the csv

Wednesday 2 June 2021

Pandas Profiling

Pandas Profiling

!pip install pandas-profiling

import pandas_profiling as pp

pp.ProfileReport(df)

Tuesday 1 June 2021

seaborn heatmap

seaborn heatmap matrix correlation

import seaborn as sns

import matplotlib.pyplot as plt

plt.figure(figsize=(20,8))

# taking all rows

df_small = df

correlation_mat = df_small.corr()

sns.heatmap(correlation_mat, annot = True, cmap='gist_yarg')

plt.show()

Where to get the colormap?

https://matplotlib.org/stable/tutorials/colors/colormaps.html

Finding special characters

Finding special characters using python pandas:

df.loc[df['ITEM'].str.contains(r'[?]')]

Check For a Substring in a Pandas DataFrame Column

Looking for strings to cut down your dataset for analysis and machine learning

The Pandas library is a comprehensive tool not only for crunching numbers but also for working with text data.

For many data analysis applications and machine learning exploration/pre-processing, you’ll want to either filter out or extract information from text data. To do so, Pandas offers a wide range of in-built methods that you can use to add, remove, and edit text columns in your DataFrames.

In this piece, let’s take a look specifically at searching for substrings in a DataFrame column. This may come in handy when you need to create a new category based on existing data (for example during feature engineering before training a machine learning model).

If you want to follow along, download the dataset here.

import pandas as pd

df = pd.read_csv('vgsales.csv')

Now let’s get started!

NOTE: we’ll be using a lot of loc in this piece, so if you’re unfamiliar with that method, check out the first article linked at the very bottom of this piece.

Using “contains” to Find a Substring in a Pandas DataFrame

The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series.str.contains("substring"). However, we can immediately take this to the next level with two additions:

Using the case argument to specify whether to match on string case;
Using the returned Series of boolean values as a mask to get a subset of the DataFrame.

Applying these two should look like this:

pokemon_games = df.loc[df['Name'].str.contains("pokemon", case=False)]

#if contains special characters:

df.loc[df['ITEM'].str.contains(r'[?]')]

Using the loc method allows us to get only the values in the DataFrame that contain the string “pokemon”. We’ve simply used the contains method to acquire True and False values based on whether the “Name” column includes our substring and then returned only the True values.

Using regex with the “contains” method in Pandas

In addition to just matching on a regular substring, we can also use contains to match on regular expressions. We’ll use the exact same format as before, except this time let’s use a bit of regex to only find the story-based Pokemon games (i.e. excluding Pokemon Pinball and the like).

pokemon_og_games = df.loc[df['Name'].str.contains("pokemon \w{1,}/", case=False)]

Above, I just used some simple regex to find strings that matched the pattern of “pokemon” + “one character or more” + “/”. The result of the new mask returned rows including “Pokemon Red/Pokemon Blue”, “Pokemon Gold/Pokemon Silver”, and more.

Next, let’s do another quick example of using regex to find all Sports games with “football” or “soccer” in its name. First, we’ll use a simple conditional statement to filter out all rows with the a genre of “sports”:

sports_games = df.loc[df['Genre'] == 'Sports']

You’ll notice that above there was no real need to match on a substring or use regex, because we were simply selecting rows based on a category. However, when matching on the row name, we’ll need to be searching different types of strings for a substring, which is where regex comes in handy. To do so, we’ll do the following:

football_soccer_games = sports_games.loc[df['Name'].str.contains("soccer|football", case=False)]

Now we’ve gotten a DataFrame with just the games that have a name including “soccer” or “football”. We simply made use of the “|” regex “or” operator that allows you to match on a string that contains one or another substring.

So we’ve successfully gotten a DataFrame with only names that contain either “football” or “soccer”, but we don’t actually know which of those two strings it contains. If we wanted to know which of the two it contained, we could use the findall method on the name column and assign the returned values to a new column in the DataFrame.

The findall method returns matches of the pattern of regular expression you specify in each string of the Series you call it on. The format is largely the same as the contains method, except you’ll need to import re to not match on string case.

import re
football_soccer_games['Football/Soccer'] = football_soccer_games['Name'].str.findall('football|soccer', flags=re.IGNORECASE)

You’ll see at the end of the returned DataFrame a new column that contains either “Soccer” or “Football”, depending on which of the two the videogame name contains. This can be helpful if you need to create new columns based on the existing columns and using values from those columns.

Finally, for a quick trick to exclude strings with just one additional operator on top of the basic contains method, let’s try to get all the football and soccer games that don’t include “FIFA” in the name.

not_fifa = football_soccer_games.loc[~football_soccer_games['Name'].str.contains('FIFA')]

As you can see, we’ve simply made use of the ~ operator that allows us to take all the False values of the mask inside the loc method.

And that’s all!

Working with strings can be a little tricky, but the in-built Pandas methods are versatile and allow you to slice and dice your data in pretty much whatever way you need. The contains and findall methods allow you to do a lot, especially when you’re able to write some regular expressions to really find specific substrings.

Good luck with your strings!

How to make a piechart

import matplotlib.pyplot as plt

# Pie chart, where the slices will be ordered and plotted counter-clockwise:

labels = outcome.index

sizes = outcome

explode = (0, 0.1) # only "explode" the 2nd slice (i.e. '1')

fig, (ax1,ax2)=plt.subplots(1,2, figsize=(14,8))

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',

shadow=True)

ax2.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',

shadow=True)

ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

ax2.axis('equal')

plt.show()

Monday 31 May 2021

finding out tvalue and pvalue

import numpy as np

from scipy import stats

a=np.random.normal(25.0, 5.0, 1000000) #mean, std deviation, number of data

finding out tvalue and pvalue

b=np.random.normal(26.0, 5.0, 1000000) #mean, std variation and number of data

stats.ttest_ind(a,b)

#if tstatisctics value is high, that means, between two datasets, there is no connection in real life.

#negative or positive value of tvalue does not really matter.

#ques is how fat away the value is from the zero!

#higher Pvalue means, there is a difference between two groups. which means the null hypothesis is wrong.

#Ttest_indResult(statistic=-141.0857113733187, pvalue=0.0),

#tvalue=

#pvalue=

How to merge dataframes

How to merge dataframes:

a=DataFrame({'category':(['electronics','mens clothing', 'womens clothing']),

'Sales inunit': np.array([7, 3, 5])

})

b=DataFrame({'category':(['electronics','mens clothing', 'womens clothing']),

'Sales inunit': np.array([7, 11, 8])

})

df3=pd.merge(a,b, on ='category', how='left')

Friday 28 May 2021

Machine Learning Crash Course

In this end-to-end Python machine learning tutorial, you’ll learn how to use Scikit-Learn to build and tune a supervised learning model!

We’ll be training and tuning a random forest for wine quality (as judged by wine ~~snobs~~ experts) based on traits like acidity, residual sugar, and alcohol concentration.

Before we start, we should state that this guide is meant for beginners who are interested in applied machine learning.

Our goal is introduce you to one of the most flexible and useful libraries for machine learning in Python. We’ll skip the theory and math in this tutorial, but we’ll still recommend great resources for learning those.

Jumpstart your data science journey today!

Enter your email to receive our free 4-part crash course on data science and applied machine learning.

Enter your name here...Enter your email address here...

Before we start…

Recommended Prerequisites

The recommended prerequisites for this guide are at least basic Python programming skills. To move quickly, we’ll assume you have this background.

Why Scikit-Learn for machine learning?

Scikit-Learn, also known as sklearn, is Python’s premier general-purpose machine learning library. While you’ll find other packages that do better at certain tasks, Scikit-Learn’s versatility makes it the best starting place for most ML problems.

It’s also a fantastic library for beginners because it offers a high-level interface for many tasks (e.g. preprocessing data, cross-validation, etc.). This allows you to better practice the entire machine learning workflow and understand the big picture.

WTF is machine learning?

Ahem… maybe this is a better place to start instead.

What this guide is not:

This is not a complete course on machine learning. Machine learning requires the practitioner to make dozens of decisions throughout the entire modeling process, and we won’t cover all of those nuances.

Instead, this is a tutorial that will take you from zero to your first Python machine learning model with as little headache as possible!

If you’re interested in mastering the theory behind machine learning, then we recommend our free guide:

How to Learn Machine Learning, The Self-Starter Way

In addition, we also won’t be covering exploratory data analysis in much detail, which is a vital part of real-world machine learning. We’ll leave that for a separate guide.

A quick tip before we begin:

This tutorial is designed to be streamlined, and it won’t cover any one topic in too much detail. It may be helpful to have the Scikit-Learn documentation open beside you as a supplemental reference.

Python Machine Learning Tutorial Contents

Here are the steps for building your first random forest model using Scikit-Learn:

Step 1: Set up your environment.

First, grab a nice glass of wine.

Drinking wine makes predicting wine easier (probably).

Next, make sure the following are installed on your computer:

Python 2.7+ or Python 3
NumPy
Pandas
Scikit-Learn (a.k.a. sklearn)

We strongly recommend installing Python through Anaconda (installation guide). It comes with all of the above packages already installed.

If you need to update any of the packages, it's as easy as typing $ conda update <package> from your command line program (Terminal in Mac).

You can confirm Scikit-Learn was installed properly:

Shell
1
2
$ python -c "import sklearn; print sklearn.__version__"
0.18.1

Great, now let's start a new file and name it sklearn_ml_example.py.

Step 2: Import libraries and modules.

To begin, let's import numpy, which provides support for more efficient numerical computation:

NumPyPython
1
import numpy as np

Next, we'll import Pandas, a convenient library that supports dataframes . Pandas is technically optional because Scikit-Learn can handle numerical matrices directly, but it'll make our lives easier:

PandasPython
1
import pandas as pd

Now it's time to start importing functions for machine learning. The first one will be the train_test_split() function from the model_selection module. As its name implies, this module contains many utilities that will help us choose between models.

Import sampling helperPython
1
from sklearn.model_selection import train_test_split

Next, we'll import the entire preprocessing module. This contains utilities for scaling, transforming, and wrangling data.

Import preprocessing modulesPython
1
from sklearn import preprocessing

Next, let's import the families of models we'll need... wait, did you just say "families?"

What's the difference between model "families" and actual models?

A "family" of models are broad types of models, such as random forests, SVM's, linear regression models, etc. Within each family of models, you'll get an actual model after you fit and tune its parameters to the data.

*Tip: Don't worry too much about this for now... It will make more sense once we get to Step 7.

We can import the random forest family like so:

Import random forest modelPython
1
from sklearn.ensemble import RandomForestRegressor

For the scope of this tutorial, we'll only focus on training a random forest and tuning its parameters. We'll have another detailed tutorial for how to choose between model families.

For now, let's move on to importing the tools to help us perform cross-validation.

Import cross-validation pipelinePython
1
2
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

Next, let's import some metrics we can use to evaluate our model performance later.

Import evaluation metricsPython
1
from sklearn.metrics import mean_squared_error, r2_score

And finally, we'll import a way to persist our model for future use.

Import module for saving scikit-learn modelsPython
1
from sklearn.externals import joblib

Joblib is an alternative to Python's pickle package, and we'll use it because it's more efficient for storing large numpy arrays.

Phew! That was a lot. Don't worry, we'll cover each function in detail once we get to it. Let's first take a quick sip of wine and toast to our progress... cheers!

Step 3: Load red wine data.

Alright, now we're ready to load our data set. The Pandas library that we imported is loaded with a whole suite of helpful import/output tools.

You can read data from CSV, Excel, SQL, SAS, and many other data formats. Here's a list of all the Pandas IO tools.

The convenient tool we'll use today is the read_csv() function. Using this function, we can load any CSV file, even from a remote URL!

Load wine data from remote URLPython
1
2
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)

Now let's take a look at the first 5 rows of data:

Output the first 5 rows of dataPython
1
2
3
4
5
6
7
print data.head()
# fixed acidity;"volatile acidity";"citric acid"...
# 0   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56...
# 1   7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68...
# 2   7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0...
# 3   11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;...
# 4   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56...

Crap... that looks really messy. Upon further inspection, it looks like the CSV file is actually using semicolons to separate the data. That's annoying, but easy to fix:

Read CSV with semicolon separatorPython
1
2
3
4
5
6
7
8
9
data = pd.read_csv(dataset_url, sep=';')
 
print data.head()
#    fixed acidity  volatile acidity  citric acid...
# 0            7.4              0.70         0.00...
# 1            7.8              0.88         0.00...
# 2            7.8              0.76         0.04... 
# 3           11.2              0.28         0.56...  
# 4            7.4              0.70         0.00...

Great, that's much nicer. Now, let's take a look at the data.

Python
1
2
print data.shape
# (1599, 12)

We have 1,599 samples and 12 features, including our target feature. We can easily print some summary statistics.

Summary statisticsPython
1
2
3
4
5
6
7
8
9
10
print data.describe()
#        fixed acidity  volatile acidity  citric acid...
# count    1599.000000       1599.000000  1599.000000...
# mean        8.319637          0.527821     0.270976...
# std         1.741096          0.179060     0.194801...
# min         4.600000          0.120000     0.000000...
# 25%         7.100000          0.390000     0.090000...
# 50%         7.900000          0.520000     0.260000...
# 75%         9.200000          0.640000     0.420000...
# max        15.900000          1.580000     1.000000...

Here's the list of all the features:

quality (target)
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

All of the features are numeric, which is convenient. However, they have some very different scales, so let's make a mental note to standardize the data later.

As a reminder, for this tutorial, we're cutting out a lot of exploratory data analysis we'd typically recommend.

For now, let's move on to splitting the data.

Step 4: Split data into training and test sets.

Splitting the data into training and test sets at the beginning of your modeling workflow is crucial for getting a realistic estimate of your model's performance.

First, let's separate our target (y) features from our input (X) features:

Separate target from training featuresPython
1
2
y = data.quality
X = data.drop('quality', axis=1)

This allows us to take advantage of Scikit-Learn's useful train_test_split function:

Split data into train and test setsPython
1
2
3
4
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

As you can see, we'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (a.k.a. seed) so that we can reproduce our results.

Finally, it's good practice to stratify your sample by the target variable. This will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.

Step 5: Declare data preprocessing steps.

Remember, in Step 3, we made the mental note to standardize our features because they were on different scales.

WTF is standardization?

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

First, here's some code that we won't use...

Scikit-Learn makes data preprocessing a breeze. For example, it's pretty easy to simply scale a dataset:

Lazy way of scaling dataPython
1
2
3
4
5
6
7
8
X_train_scaled = preprocessing.scale(X_train)
print X_trained_scaled
# array([[ 0.51358886,  2.19680282, -0.164433  , ...,  1.08415147,
#         -0.69866131, -0.58608178],
#        [-1.73698885, -0.31792985, -0.82867679, ...,  1.46964764,
#          1.2491516 ,  2.97009781],
#        [-0.35201795,  0.46443143, -0.47100705, ..., -0.13658641,
# ...

You can confirm that the scaled dataset is indeed centered at zero, with unit variance:

Python
1
2
3
4
5
print X_train_scaled.mean(axis=0)
# [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 
print X_train_scaled.std(axis=0)
# [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

Great, but why did we say that we won't use this code?

The reason is that we won't be able to perform the exact same transformation on the test set.

Sure, we can still scale the test set separately, but we won't be using the same means and standard deviations as we used to transform the training set.

In other words, that means it wouldn't be a fair representation of how the model pipeline, include the preprocessing steps, would perform on brand new data.

Now, here's the preprocessing code we will use...

So instead of directly invoking the scale function, we'll be using a feature in Scikit-Learn called the Transformer API. The Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model...

...and then use the same transformation on future data sets!

Here's what that process looks like:

Fit the transformer on the training set (saving the means and standard deviations)
Apply the transformer to the training set (scaling the training data)
Apply the transformer to the test set (using the same means and standard deviations)

This makes your final estimate of model performance more realistic, and it allows to insert your preprocessing steps into a cross-validation pipeline (more on this in Step 7).

Here's how you do it:

Fitting the Transformer APIPython
1
scaler = preprocessing.StandardScaler().fit(X_train)

Now, the scaler object has the saved means and standard deviations for each feature in the training set.

Let's confirm that worked:

Applying transformer to training dataPython
1
2
3
4
5
6
7
X_train_scaled = scaler.transform(X_train)
 
print X_train_scaled.mean(axis=0)
# [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 
print X_train_scaled.std(axis=0)
# [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

Note how we're taking the scaler object and using it to transform the training set. Later, we can transform the test set using the exact same means and standard deviations used to transform the training set:

Applying transformer to test dataPython
1
2
3
4
5
6
7
8
9
X_test_scaled = scaler.transform(X_test)
 
print X_test_scaled.mean(axis=0)
# [ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
#  -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
 
print X_test_scaled.std(axis=0)
# [ 1.02160495  1.00135689  0.97456598  0.91099054  0.86716698  0.94193125
#  1.03673213  1.03145119  0.95734849  0.83829505  1.0286218 ]

Notice how the scaled features in the test set are not perfectly centered at zero with unit variance! This is exactly what we'd expect, as we're transforming the test set using the means from the training set, not from the test set itself.

In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the Transformer API. Instead, we'll simply declare the class object, like so:

Pipeline with preprocessing and modelPython
1
2
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

This is exactly what it looks like: a modeling pipeline that first transforms the data using StandardScaler() and then fits a model using a random forest regressor.

Step 6: Declare hyperparameters to tune.

Now it's time to consider the hyperparameters that we'll want to tune for our model.

WTF are hyperparameters?

There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot.

Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.

Example: random forest hyperparameters.

As an example, let's take our random forest for regression:

Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are model parameters.

However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of hyperparameters that the user must set.

We can list the tunable hyperparameters like so:

List tunable hyperparametersPython
1
2
3
4
5
6
7
print pipeline.get_params()
# ...
# 'randomforestregressor__criterion': 'mse',
# 'randomforestregressor__max_depth': None,
# 'randomforestregressor__max_features': 'auto',
# 'randomforestregressor__max_leaf_nodes': None,
# ...

You can also find a list of all the parameters on the RandomForestRegressor documentation page. Just note that when it's tuned through a pipeline, you'll need to prepend randomforestregressor__ before the parameter name, like in the code above.

Now, let's declare the hyperparameters we want to tune through cross-validation.

Declare hyperparameters to tunePython
1
2
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

As you can see, the format should be a Python dictionary (data structure for key-value pairs) where keys are the hyperparameter names and values are lists of settings to try. The options for parameter values can be found on the documentation page.

Step 7: Tune model using a cross-validation pipeline.

Now we're almost ready to dive into fitting our models. But first, we need to spend some time talking about cross-validation.

This is one of the most important skills in all of machine learning because it helps you maximize model performance while reducing the chance of overfitting.

WTF is cross-validation (CV)?

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

Practically, that "method" is simply a set of hyperparameters in this context.

These are the steps for CV:

Split your data into k equal parts, or "folds" (typically k=10).
Train your model on k-1 folds (e.g. the first 9 folds).
Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).
Perform steps (2) and (3) k times, each time holding out a different fold.
Aggregate the performance across all k folds. This is your performance metric.

K-Fold Cross-Validation diagram, courtesy of Wikipedia

Why is cross-validation important in machine learning?

Let's say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.

How can you decide?

That's where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness.

This allows you to keep your test set "untainted" and save it for a true hold-out evaluation when you're finally ready to select a model.

For example, you can use CV to tune a random forest model, a linear regression model, and a k-nearest neighbors model, using only the training set. Then, you still have the untainted test set to make your final selection between the model families!

So WTF is a cross-validation "pipeline?"

The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

Here's how the CV pipeline looks after including preprocessing steps:

Split your data into k equal parts, or "folds" (typically k=10).
Preprocess k-1 training folds.
Train your model on the same k-1 folds.
Preprocess the hold-out fold using the same transformations from step (2).
Evaluate your model on the same hold-out fold.
Perform steps (2) - (5) k times, each time holding out a different fold.
Aggregate the performance across all k folds. This is your performance metric.

Fortunately, Scikit-Learn makes it stupidly simple to set this up:

Sklearn cross-validation with pipelinePython
1
2
3
4
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
# Fit and tune model
clf.fit(X_train, y_train)

Yes, it's really that easy. GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters.

It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create.

Obviously, there's a lot going on under the hood. We've included the pseudo-code above, and we'll cover writing cross-validation from scratch in a separate guide.

Now, you can see the best set of parameters found using CV:

Python
1
2
print clf.best_params_
# {'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'auto'}

Interestingly, it looks like the default parameters win out for this data set.

*Tip: It turns out that in practice, random forests don't actually require a lot of tuning. They tend to work pretty well out-of-the-box with a reasonable number of trees. Even so, these same steps can be used when building any type of supervised learning model.

Step 8: Refit on the entire training set.

After you've tuned your hyperparameters appropriately using cross-validation, you can generally get a small performance improvement by refitting the model on the entire training set.

Conveniently, GridSearchCV from sklearn will automatically refit the model with the best set of hyperparameters using the entire training set.

This functionality is ON by default, but you can confirm it:

Confirm model will be retrainedPython
1
2
print clf.refit
# True

Now, you can simply use the clf object as your model when applying it to other sets of data. That's what we'll be doing in the next step.

Step 9: Evaluate model pipeline on test data.

Alright, we're in the home stretch!

This step is really straightforward once you understand that the clf object you used to tune the hyperparameters can also be used directly like a model object.

Here's how to predict a new set of data:

Predict a new set of dataPython
1
y_pred = clf.predict(X_test)

Now we can use the metrics we imported earlier to evaluate our model performance.

Python
1
2
3
4
5
print r2_score(y_test, y_pred)
# 0.45044082571584243
 
print mean_squared_error(y_test, y_pred)
# 0.35461593750000003

Great, so now the question is... is this performance good enough?

Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance.

Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem?
Look in academic literature to get a sense of the current performance benchmarks for specific types of data.
Try to find low-hanging fruit in terms of ways to improve your model.

There are various ways to improve a model. We'll have more guides that go into detail about how to improve model performance, but here are a few quick things to try:

Try other regression model families (e.g. regularized regression, boosted trees, etc.).
Collect more data if it's cheap to do so.
Engineer smarter features after spending more time on exploratory analysis.
Speak to a domain expert to get more context (...this is a good excuse to go wine tasting!).

As a final note, when you try other families of models, we recommend using the same training and test set as you used to fit the random forest model. That's the best way to get a true apples-to-apples comparison between your models.

Step 10: Save model for future use.

Great job completing this tutorial!

You've done the hard part, and deserve another glass of wine. Maybe this time you can use your shiny new predictive model to select the bottle.

But before you go, let's save your hard work so you can use the model in the future. It's really easy to do so:

Save model to a .pkl filePython
1
joblib.dump(clf, 'rf_regressor.pkl')

And that's it. When you want to load the model again, simply use this function:

Load model from .pkl filePython
1
2
3
4
clf2 = joblib.load('rf_regressor.pkl')
 
# Predict data set using loaded model
clf2.predict(X_test)

Congratulations, you've reached the end of this tutorial!

We've just completed a whirlwind tour of Scikit-Learn's core functionality, but we've only really scratched the surface. Hopefully you've gained some guideposts to further explore all that sklearn has to offer.

For continued learning, we recommend studying other examples in sklearn.

The complete code, from start to finish.

Here's all the code in one place, in a single script.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 2. Import libraries and modules
import numpy as np
import pandas as pd
 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib 
 
# 3. Load red wine data.
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')
 
# 4. Split data into training and test sets
y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)
 
# 5. Declare data preprocessing steps
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))
 
# 6. Declare hyperparameters to tune
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}
 
# 7. Tune model using cross-validation pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
clf.fit(X_train, y_train)
 
# 8. Refit on the entire training set
# No additional code needed if clf.refit == True (default is True)
 
# 9. Evaluate model pipeline on test data
pred = clf.predict(X_test)
print r2_score(y_test, pred)
print mean_squared_error(y_test, pred)
 
# 10. Save model for future use
joblib.dump(clf, 'rf_regressor.pkl')
# To load: clf2 = joblib.load('rf_regressor.pkl')

Python for Data Analysis

Pages

Tuesday 22 June 2021

Renaming pandas column without knowing column name.

Excel VBA: splitting cell values into rows

Thursday 3 June 2021

A simple machine learning Sklearn

Wednesday 2 June 2021

Pandas Profiling

Tuesday 1 June 2021

seaborn heatmap

Finding special characters

df.loc[df['ITEM'].str.contains(r'[?]')]

Check For a Substring in a Pandas DataFrame Column

Check For a Substring in a Pandas DataFrame Column

Looking for strings to cut down your dataset for analysis and machine learning

Using “contains” to Find a Substring in a Pandas DataFrame

Using regex with the “contains” method in Pandas

How to make a piechart

Monday 31 May 2021

finding out tvalue and pvalue

How to merge dataframes

Friday 28 May 2021

Machine Learning Crash Course

Before we start…

Python Machine Learning Tutorial Contents

Step 1: Set up your environment.

Step 2: Import libraries and modules.

Step 3: Load red wine data.

Step 4: Split data into training and test sets.

Step 5: Declare data preprocessing steps.

Step 6: Declare hyperparameters to tune.

Step 7: Tune model using a cross-validation pipeline.

Step 8: Refit on the entire training set.

Step 9: Evaluate model pipeline on test data.

Step 10: Save model for future use.

The complete code, from start to finish.