Analysis of a multiclass image dataset using Scikit-learn

Project link: https://github.com/HL-Boisvert/Data_Mining_Portfolio

Dataset link: https://www.kaggle.com/andrewmvd/animal-faces

This dataset consists of 16130 512*512 images in 3 classes: cat, dog and wild animal.

Several interesting conclusions were drawn from this challenging dataset:

  • Using Pearson correlation it is possible to determine with pixels have the best correlation score and thus optimize the training:
Capture d’écran 2021-11-26 à 19 18 54
  • The classes were determined to not be linearly separable
  • Classification using k-means clustering is very inefficient due to the properties of the dataset (35% accuracy on testing dataset):
Capture d’écran 2021-11-26 à 19 18 46
  • Classifying using random forests algorithms works very well, as does the Multi-Layers Perceptron classifier (respectively 75% and 80% accuracy on testing dataset).
  • However classifying using Convolutional Neural Networks gives the best results, reaching 95% accuracy with optimum meta-parameters.
Henri-Louis Boisvert
Henri-Louis Boisvert
MSc student in AI