Analysis of a multiclass image dataset using Scikit-learn

Last updated on Nov 26, 2021

Project link: https://github.com/HL-Boisvert/Data_Mining_Portfolio

Dataset link: https://www.kaggle.com/andrewmvd/animal-faces

This dataset consists of 16130 512*512 images in 3 classes: cat, dog and wild animal.

Several interesting conclusions were drawn from this challenging dataset:

Using Pearson correlation it is possible to determine with pixels have the best correlation score and thus optimize the training:

Capture d’écran 2021-11-26 à 19 18 54

The classes were determined to not be linearly separable
Classification using k-means clustering is very inefficient due to the properties of the dataset (35% accuracy on testing dataset):

Capture d’écran 2021-11-26 à 19 18 46

Classifying using random forests algorithms works very well, as does the Multi-Layers Perceptron classifier (respectively 75% and 80% accuracy on testing dataset).
However classifying using Convolutional Neural Networks gives the best results, reaching 95% accuracy with optimum meta-parameters.

Machine Learning