Menggunakan GridSearchCV untuk Mencari Parameter Optimal Pengklasifikasi Scikit-Learn

Muhammad Arslan 4 Januari 2017

Menggunakan GridSearchCV untuk Mencari Parameter Optimal Pengklasifikasi Scikit-Learn

Terkadang hasil akurasi dari pembuatan model sangat kurang dari target. Bukan hanya masalah dataset dan preprocessing yang kurang baik, tapi pemilihan parameter untuk pengklasifikasi pun dapat menjadi salah satu penyebabnya. Di Scikit-Learn, kamu dapat menggunakan GridSearchCV untuk mencari parameter terbaik untuk pengklasifikasi yang ingin kamu gunakan. Prosesnya akan dilakukan secara brute force dan melaporkan mana parameter yang memiliki akurasi paling baik.

Uuntuk lebih lanjutnya mari kita ikuti tutorial berikut :D.

Persiapan

Spesifikasi komputer yang diperlukan untuk tutorial ini adalah:
  • RAM 4 GB atau lebih
  • Intel Core i3 dengan QuadCore
  • Swap 4 GB (bila menggunakan Linux)
Sedangkan untuk modul aplikasi, kamu memerlukan beberapa barang berikut:
  • Python
  • Scikit-Learn
  • Scipy
  • Numpy
  • Dataset 20newsgroup

Menyiapkan contoh kode pengklasifikasi

Silahkan buat terlebih dahulu file bernama gridsearchcv-demo.py. Kemudian buat kode berikut di dalam file tersebut:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

import json import datetime

menyiapkan dataset

dataset_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

mengatur classifier

clf = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()) ])

params = { 'vect__max_df': (0.75, 1.0), #'vect__max_features': (None, 5000, 10000, 50000), # 'vect__ngram_range': ((1, 1), (1, 2)), #'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': (0.00001, 0.000001), # 'clf__penalty': ('l2', 'elasticnet'), #'clf__n_iter': (10, 50, 80), }

grid = GridSearchCV( clf, params, n_jobs=4, cv=10, verbose=4, )

clf = grid.fit(dataset_train.data, dataset_train.target)

print "\nBest estmator:" print print clf.best_estimator_

print "\nGrid score:" print for params, mean_score, scores in clf.grid_scores_: print "%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params) print

Pertama kita siapkan terlebih dahulu data latih yang diambil dari dataset 20 newsgroups. Kemudian kita siapkan pipeline yang berisi pengklasifikasi default yang terdiri dari CountVectorizer, TfidfTransformer, dan SGDClassifier. Kemudian kita tentukan juga kombinasi parameter yang ingin kita uji pada pipeline untuk mendapatkan hasil terbaik. DI tutorial ini kita teliti parameter max_df, norm, dan alpha. Lalu kita buat instans GridSearchCV yang menerima parameter pengklasifikasi, parameter yang mau dicari, n_jobs sebanyak 4, cross validation sebanyak 10, dan output di konsol dengan tingkat kejelasan 4.

Setelah itu kita masukkan dataset kedalam GridSearchCV untuk diperiksa dan laporan pun akan diberikan setelah selesai melakukan pencarian parameter.

Mulai Menggunakan GridSearchCV

Sekarang mari kita jalankan skrip tersebut. Setelah menjalankannya dalam kurun waktu 6 menit, berikut adalah hasil output yang diberikan selama proses pencarian parameter untuk pengklasifikasi yang akan kita bangun dengan menggunakan GridSearchCV:
$ python gridsearchcv-demo.py 

Fitting 10 folds for each of 8 candidates, totalling 80 fits [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888596 - 13.4s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.890158 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.886643 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.901408 - 16.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.894876 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.877768 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.889184 - 14.8s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888099 - 13.7s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.887011 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888691 - 13.6s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886842 - 16.0s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.891916 - 15.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.894552 - 13.3s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.893486 - 14.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.877768 - 14.4s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886042 - 15.3s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.887411 - 13.9s [Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 1.2min [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886323 - 14.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.882562 - 13.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.888691 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.907895 - 15.4s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.913884 - 14.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.920035 - 13.9s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.930458 - 14.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.904340 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.939929 - 15.4s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.921099 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.909414 - 16.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.919858 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.922598 - 17.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.911404 - 15.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.925308 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.918278 - 11.8s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.923415 - 13.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.936396 - 13.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.908769 - 12.5s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.921099 - 12.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.922735 - 12.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.920819 - 12.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.918967 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.912281 - 14.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.915641 - 13.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.914763 - 13.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.929577 - 13.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.931095 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.906997 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.921099 - 12.0s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.917407 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.923488 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.913624 - 11.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.907895 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.921793 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.907733 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.926056 - 12.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.929329 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.902569 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.909574 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.912966 - 12.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.925267 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.911843 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.890351 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.901582 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.903339 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.916373 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.921378 - 11.0s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.885740 - 11.2s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.898936 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.899645 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.903025 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.910062 - 11.8s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.895614 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.898946 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.905097 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.919014 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.917845 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.888397 - 11.8s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.901596 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.894316 - 10.5s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.903915 - 9.5s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.910062 - 9.6s [Parallel(n_jobs=4)]: Done 80 out of 80 | elapsed: 4.5min finished

Best estmator:

Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, st... penalty='l2', power_t=0.5, random_state=None, shuffle=True, verbose=0, warm_start=False))])

Grid score:

0.889 (+/-0.003) for {'vect__max_df': 0.75, 'tfidf__norm': 'l1', 'clf__alpha': 1e-05} 0.888 (+/-0.002) for {'vect__max_df': 1.0, 'tfidf__norm': 'l1', 'clf__alpha': 1e-05} 0.919 (+/-0.005) for {'vect__max_df': 0.75, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} 0.921 (+/-0.004) for {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} 0.919 (+/-0.004) for {'vect__max_df': 0.75, 'tfidf__norm': 'l1', 'clf__alpha': 1e-06} 0.916 (+/-0.004) for {'vect__max_df': 1.0, 'tfidf__norm': 'l1', 'clf__alpha': 1e-06} 0.903 (+/-0.005) for {'vect__max_df': 0.75, 'tfidf__norm': 'l2', 'clf__alpha': 1e-06} 0.903 (+/-0.005) for {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-06}

Setelah melalui proses GridSearchCV, kita dapat memilih parameter yang terbaik. Pada hasil diatas kita dapat memilih parameter {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} dengan skor 0.921 untuk dilewatkan kedalam pipeline yang telah kita buat. Kita dapat melewatkan parameter max_df=1.0 ke dalam CountVectorizer(), norm='l2' ke dalam TfidfTransformer, dan alpha=1e-05 ke dalam SGDClassifier. Dengan demikian akurasi pun dapat meningkat lebih signifikan ketimbang tidak menggunakan GridSearchCV.

Bila kamu mempunyai sumber daya hardware yang lebih besar dan tangkas, kamu dapat mencabut semua tanda komentar pada bagian params. Sehingga kamu dapat melihat berbagai kombinasi yang lebih baik untuk mendapatkan akurasi yang lebih tinggi.

Referensi

  • Scikit-Learn Official Documentation
  • Python Official Documentation
(arslan/scikit-learn/python)