Python has become the most popular data science and machine learning programming language. But in order to obtain effective data and results, it’s important that you have a basic understanding of how it works with machine learning.
In this introductory tutorial, you’ll learn the basics of Python for machine learning, including different model types and the steps to take to ensure you obtain quality data, using a sample machine learning problem. In addition, you’ll get to know some of the most popular libraries and tools for machine learning.
Also read: Best Machine Learning Software
Machine Learning 101
Machine learning (ML) is a form of artificial intelligence (AI) that teaches computers to make predictions and recommendations and solve problems based on data. Its problem-solving capabilities make it a useful tool in industries such as financial services, healthcare, marketing and sales, and education among others.
Types of machine learning
There are three main types of machine learning: supervised, unsupervised, and reinforcement.
In supervised learning, the computer is given a set of training data that includes both the input data (what we want to predict) and the output data (the prediction). The computer then learns a model that maps input to output data to make predictions on new, unseen data.
In unsupervised learning, the computer is only given the input data. The computer then learns to find patterns and relationships in the data and applies this to things like clustering or dimensionality reduction.
You can use many different algorithms for machine learning. Some popular examples include:
- Linear regression
- Logistic regression
- Decision trees
- Random forests
- Support vector machines
- Naive bayes
- Neural networks
The choice of algorithm will depend on the problem you are trying to solve and the available data.
Reinforcement learning is a process where the computer learns by trial and error. The computer is given a set of rules (the environment) and must learn how to maximize its reward (the goal). This can be used for things like playing games or controlling robots.
The steps of a machine learning project
The first step in any machine learning project is to import the data. This data can come from various sources, including files on your computer, databases, or web APIs. The format of the data will also vary depending on the source.
For example, you may have a CSV file containing tabular data or an image file containing raw pixel data. No matter the source or format, you must load the data into memory before doing anything with it. This can be accomplished using a library like NumPy, Scikit Learn, or Pandas.
Once the data is loaded, you will usually want to scrutinize it to ensure everything looks as expected. This step is critical, especially when working with cluttered or unstructured data.
Once you have imported the data, the next step is to clean it up. This can involve various tasks, such as removing invalid, missing, or duplicated data; converting data into the correct format; and normalizing data. This step is crucial because it can make a big difference in the performance of your machine learning model.
For example, if you are working with tabular data, you will want to ensure all of the columns are in the proper format (e.g., numeric values instead of strings). You will also want to check missing values and decide how to handle them (e.g., imputing the mean or median value).
If you are working with images, you may need to resize or crop them to be the same size. You may also want to convert images from RGB to grayscale.
Also read: Top Data Quality Tools & Software
Splitting data into training/test sets
After cleaning the data, you’ll need to split it into training and test sets. The training set is used to train the machine learning model, while the test set evaluates the model. Keeping the two sets separate is vital because you don’t want to train the model on the test data. This would give the model an unfair advantage and likely lead to overfitting.
A standard split for large datasets is 80/20, where 80% of the data is used for training and 20% for testing.
Using the prepared data, you’ll then create the machine learning model. There are a variety of algorithms you can use for this task, but determining which to use depends on the goal you wish to achieve and the existing data.
For example, if you are working with a small dataset, you may want to use a simple algorithm like linear regression. If you are working with a large dataset, you may want to use a more complex algorithm like a neural network.
In addition, decision trees may be ideal for problems where you need to make a series of decisions. And random forests are suitable for problems where you need to make predictions based on data that is not linearly separable.
Once you have chosen an algorithm and created the model, you need to train it on the training data. You can do this by passing the training data through the model and adjusting the parameters until the model learns to make accurate predictions on the training data.
For example, if you train a model to identify images of cats, you will need to show it many photos of cats labeled as such, so it can learn to recognize them.
Training a machine learning model can be pretty complex and is often an iterative process. You may also need to try different algorithms, parameter values, or ways of preprocessing the data.
Evaluation and improvement
After you train the model, you’ll need to evaluate it on the test data. This step will give you a good indication of how well the model will perform on unseen data.
If the model does not perform well on the test data, you will need to go back and make changes to the model or the data. This is often the usual scenario when you first train a model—you must go back and iterate several times until you get a model that performs well.
This process is known as model tuning and is an integral part of the machine learning workflow.
Python Libraries and Tools
There are several libraries and tools that you can use to build machine learning models in Python.
One of the most popular libraries is scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN.
The library is built on NumPy, SciPy, and Matplotlib libraries. In addition, it includes many utility functions for data preprocessing, feature selection, model evaluation, and input/output.
Scikit-learn is one of the most popular machine learning libraries available today, and you can use it for various tasks. For example, you can use it to build predictive models for classification or regression problems. You can also use it for unsupervised learning tasks such as clustering or dimensionality reduction.
NumPy is another popular Python library that supports large, multi-dimensional arrays and matrices. It also includes several routines for linear algebra, Fourier transform, and random number generation.
NumPy is widely used in scientific computing and has become a standard tool for machine learning problems.
Its popularity is due to its ease of use and efficiency; NumPy code is often much shorter and faster than equivalent code written in other languages. In addition, NumPy integrates well with other Python libraries, making it easy to use in a complete machine learning stack.
Pandas is a powerful Python library for data analysis and manipulation. It’s commonly used in machine learning applications for preprocessing data, as it offers a wide range of features for cleaning, transforming, and manipulating data. In addition, Pandas integrates well with other scientific Python libraries, such as NumPy and SciPy, making it a popular choice for data scientists and engineers.
At its core, Pandas is designed to make working with tabular data easier. It includes convenient functions for reading in data from various file formats; performing basic operations on data frames, such as selection, filtering, and aggregation; and visualizing data using built-in plotting functions. Pandas also offers more advanced features for dealing with complex datasets, such as join/merge operations and time series manipulation.
Pandas is a valuable tool for any data scientist or engineer who needs to work with tabular data. It’s easy to use and efficient, and it integrates well with other Python libraries.
Matplotlib is a Python library that enables users to create two-dimensional graphics. The library is widely used in machine learning due to its ability to create visualizations of data. This is valuable for machine learning problems because it allows users to see patterns in the data that they may not be able to discern by looking at raw numbers.
Additionally, you can use Matplotlib to create simulations of machine learning algorithms. This feature can be helpful for debugging purposes or for understanding how the algorithm works.
Seaborn is a Python library for creating statistical graphics. It’s built on top of Matplotlib and integrates well with Pandas data structures.
Seaborn is often used for exploratory data analysis, as it allows you to create visualizations of your data easily. In addition, you can use Seaborn to create more sophisticated visualizations, such as heatmaps and time series plots.
Overall, Seaborn is a valuable tool for any data scientist or engineer who needs to create statistical graphics.
The Jupyter Notebook is a web-based interactive programming environment that allows users to write and execute code in various languages, including Python.
The Notebook has gained popularity in the machine learning community due to its ability to streamline the development process by allowing users to write and execute code in the same environment and inspect the data frequently.
Another reason for its popularity is its graphical user interface (GUI), which makes it easier to use than command-line editors such as Terminal and VS Code. For example, it isn’t easy to visualize and inspect data that contains several columns in a command-line editor.
Training a Machine Learning Algorithm with Python Using the Iris Flowers Dataset
For this example, we will be using the Jupyter Notebook to train a machine learning algorithm with the classic Iris Flowers dataset.
Although the Iris Flowers dataset is small, it will allow us to demonstrate how to use Python for machine learning. This dataset has been used extensively in pattern recognition and machine learning literature. It is also relatively easy to understand, making it a good choice for our first problem.
The Iris Flowers dataset contains 150 observations of Iris flowers. The goal is to take measurements of flowers and use that data to predict what species of Iris it is based on the following physical parameters of three Iris species:
Installing Jupyter Notebook with Anaconda
Before getting started with training the machine learning algorithm, we will need to install Jupyter. To do so, we will use a platform known as Anaconda.
Anaconda is a free and open-source distribution of the Python programming language that includes the Jupyter Notebook. It also has various other useful libraries for data analysis, scientific computing, and machine learning.
Jupyter Notebook with Anaconda is a powerful tool for any data scientist or engineer working with Python, whether using Windows, Mac, or Linux operating systems (OSs).
Visit the Anaconda website and download the installer for your operating system. Follow the instructions to install it, and launch the Anaconda Navigator application.
To do this on most OSs, you must open a terminal window, type jupyter notebook, and hit Enter. This action will start the Jupyter Notebook server on your machine.
It also automatically displays the Jupyter Dashboard in a new browser window pointing to your Localhost at port 8888.
Creating a new notebook
Once you have Jupyter installed, you can begin training your machine learning algorithm. Start by creating a new notebook.
To create a new notebook, select the folder where you want to store the new notebook and then click the New button in the upper right corner of the interface and select Python [default]. This action will create a new notebook with Python code cells.
New notebooks are automatically opened in a new browser tab named Untitled. You can rename it by clicking Untitled. For our tutorial, rename it Iris Flower.
Importing a dataset into Jupyter
We’ll get our dataset from the Kaggle website. Head over to Kaggle.com and create a free account using a custom email, Google, or Facebook.
Next, find the Iris dataset by clicking Datasets in the left navigation pane and entering Iris Flowers in the search bar.
The CSV file contains 150 records under five attributes—petal length, petal width, sepal length, sepal width, and class (species)—so there are only five columns in total.
Once you’ve found the dataset, click the Download button, and ensure the download location is the same as that of your Jupyter Notebook. Unzip the file to your computer.
Next, open Jupyter Notebook and click on the Upload button in the top navigation bar. Find the dataset on your computer and click Open. You will now upload the dataset to your Jupyter Notebook environment.
We can now import the dataset into our program. We’ll use the Pandas library for this. This pre-prepared dataset doesn’t have much to do with data preparation.
Start by typing the following code into a new cell and click run:
import pandas as pd
This first line will import the Pandas library into our program, allow us to use it, and rename it pd.
The second line will read the CSV file and store it in a variable called iris. View the dataset by typing iris and running the cell.
You should see something similar to the image below:
As you can see, each row represents one Iris flower with its attributes listed in the columns.
The first four columns are the attributes or features of the Iris flower, and the last column is the class label which corresponds to a species of Iris Flower, such as Iris setosa, Iris virginica, etc.
Before proceeding, we need to remove the ID column because it can cause problems with our classification model. To do so, enter the following code in a new cell.
iris.drop(columns = ‘Id’, inplace = True)
Type iris once more to see the output. You will notice the Id column has been dropped.
Understanding the Data
Now that we know how to import the dataset let’s look at some basic operations we can perform to understand the data better.
First, let’s see what data types are in our dataset. To do this, we’ll use the dtypes attribute of the dataframe object. Type the following code into a new cell and run it:
You should see something like this:
You can see that all of the columns are floats except for the Species column, which is an object. This is because objects in Pandas are usually strings.
Now let’s examine some summary statistics for our data using the describe function. Type the following code into a new cell and run it:
You can see that this gives us some summary statistics for each column in our dataset.
We can also use the head and tail functions to look at the first and last few rows of our dataset, respectively. Type the following code into a new cell and run it:
We can see the first five rows of our dataframe correspond to the Iris setosa class, and the last five rows correspond to the Iris virginica.
Next, we can visualize the data using several methods. For this, we will need to import two libraries, Matplotlib and Seaborn.
Type the following code into a new cell:
import seaborn as sns
import matplotlib.pyplot as plt
You will also need to set the style and color codes of Seaborn. Additionally, the current Seaborn version generates warnings that we can ignore for this tutorial. Enter the following code:
For the first visualization, create a scatter plot using Matplotlib. Enter the following code in a new cell.
iris.plot(kind=”scatter”, x=”SepalLengthCm”, y=”SepalWidthCm”)
This will generate the following output:
However, to color the scatterplot by species, we will use Seaborn’s FacetGrid class. Enter the following code in a new cell.
sns.FacetGrid(iris, hue=”Species”, size=5) \
.map(plt.scatter, “SepalLengthCm”, “SepalWidthCm”) \
Your output should be as follows:
As you can see, Seaborn has automatically colored our scatterplot, so we can visualize our dataset better and see differences in sepal width and length for the three different Isis species.
We can also create a boxplot using Seaborn to visualize the petal length of each species. Enter the following code in a new cell:
sns.boxplot(x=”Species”, y=”PetalLengthCm”, data=iris)
You can also extend this plot by adding a layer of individual points using Seaborn’s striplot. Type the following code in a new cell:
ax = sns.boxplot(x=”Species”, y=”PetalLengthCm”, data=iris)
ax = sns.stripplot(x=”Species”, y=”PetalLengthCm”, data=iris, jitter=True, edgecolor=”gray”)
Another possible visualization is the kernel density plots (KD Plots) which shows the probability density. Enter the following code:
sns.FacetGrid(iris, hue=”Species”, size=6) \
.map(sns.kdeplot, “PetalLengthCm”) \
A Pairplot is another useful Seaborn visualization. It shows the relationships between all columns in our dataset. Enter the following code into a new cell:
sns.pairplot (iris, hue=”Species”, size=3)
The output should be as follows:
From the above, you can quickly tell the Iris setosa species is separated from the rest across all feature combinations.
Similarly, you can also create a Boxplot grid using the code:
iris.boxplot(by=”Species”, figsize=(12, 6))
Let’s perform one final visualization that places each feature on a 2D plane. Enter the code:
from pandas.plotting import radviz
Split the data into a test and training set
Having understood the data, you can now proceed and begin training the model. But first we need to split our data into a training and test set. To do this, we will use a function known as train_test_split from the scikit-learn library. This action will divide our data set into a ratio of 70:30 (Our dataset is small hence a higher test set).
Enter the following code in a new cell:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
Next, separate the data into dependent and independent variables:
X = iris.iloc[:, :-1].values
y = iris.iloc[:, -1].values
Split into a training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
The confusion matrix we imported is a table that is often used to evaluate the performance of a machine learning algorithm. The matrix comprises four quadrants, each representing the predicted and actual values for one of the two classes.
The first quadrant represents the true positives, or the observations correctly predicted to be positive. The second quadrant represents the false positives, which are the observations that were incorrectly predicted to be positive. The third quadrant represents the false negatives, which are the observations that were incorrectly predicted to be negative. Finally, the fourth quadrant represents the true negatives, or the observations correctly predicted to be negative.
The matrix rows represent the actual values, while the columns represent the predicted values.
Train the model and check accuracy
We will train the model and check the accuracy using four different algorithms: logistic regression, random forest classifier, decision tree classifier, and multinomial naive bayes.
To do so, we will create a series of objects in various classes and store them in variables. Be sure to take note of the accuracy scores.
Enter the code below in a new cell:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
y_pred = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
Random forest classifier
Enter the code below in a new cell:
from sklearn.ensemble import RandomForestClassifier
y_pred = classifier.predict(X_test)
Decision tree classifier
Enter the code below in a new cell:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
y_pred = classifier.predict(X_test)
Multinomial naive bayes
Enter the following code in a new cell:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
y_pred = classifier.predict(X_test)
Evaluating the model
Based on the training, we can see that three of our four algorithms have a high accuracy of 0.97. We can therefore choose any of these to evaluate our model. For this tutorial, we have selected the decision tree, which has high accuracy.
We will give our model sample values for sepal length, sepal width, petal length, and petal width and ask it to predict which species it is.
Our sample flower has the following dimensions in centimeters (cms):
- Sepal length: 6
- Sepal width: 3
- Petal length: 4
- Petal width: 2
Using a decision tree, enter the following code:
predictions = classifier.predict([[6,3,4,2]])
The output result is Iris-virginica.
Some Final Notes
As an introductory tutorial, we used the Iris Flowers dataset, which is a straightforward dataset containing only 150 records. Our training set only has 45 records (30%), hence similar accuracies with most of the algorithms.
However, in a real-world situation, the dataset may have thousands or millions of records. That said, Python is well-suited for handling large datasets and can easily scale up to higher dimensions.