A Support Vector Machine for the classic Iris Dataset

A Support Vector Machine (SVM) is a classifier that aims to find an optimal hyperplane that separates a number of categories. That separartion of categories can then be used for predictions. SVM's have been widely used in the biological sciences so it makes sense to give it a run over the Iris datset built into the Seaborn library. A famous dataset that dates back to 1936.

Using the scikit-learn train test split utility, the data is prepared for the model. The scikit-learn svm.svc classifier is the standard approach and with this dataset yields an f1 score of 0.98. A great result, but in order to be competitive on a site like Kaggle and set yourself apart from the crowd. It's well worth tweaking the model to see what may be gained.

For this we can use a grid search which allows for multiple paramaters to be tested. The parameters are specified in gridsearchCV through the param_grid argument and through a process of iteration and evaluation the best paramters can be set. Subsequent evaluation of the model yields full scores of 1.0...and the need for a more complicate dataset.

                
                    import seaborn as sns

                    from sklearn.model_selection import train_test_split
                    from sklearn.svm import SVC
                    from sklearn.metrics import classification_report, confusion_matrix
                    from sklearn.model_selection import GridSearchCV


                    iris = sns.load_dataset('iris')

                    X = iris.drop('species', axis=1)
                    Y = iris['species']

                    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
                    model = SVC()
                    model.fit(X_train, Y_train)
                    predictions = model.predict(X_test)
                    print(confusion_matrix(Y_test, predictions))
                    print(classification_report(Y_test, predictions))

                    param_grid = {'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}
                    grid = GridSearchCV(SVC(), param_grid, verbose=3)
                    grid.fit(X_train, Y_train)
                    grid.best_params_
                    grid.best_estimator_

                    grid_predictions = grid.predict(X_test)
                    print(confusion_matrix(Y_test, grid_predictions))
                    print(classification_report(Y_test, grid_predictions))