[OSS] Machine Learning with scikit-learn

CS/Lecture Note

[OSS] Machine Learning with scikit-learn

Uykm 2023. 11. 12. 22:34

▼ Machine Learning

Machine Learning

머신 러닝은 Induction(귀납법)을 기반, 데이터를 중심으로 하는 방법이다. (↔ Deduction(연역법))
vs. Deep (neural network) learning
➜ 데이터가 적게 발생하는 케이스는 '머신 러닝' 을 여전히 사용한다.

ML procedure (~ trial and error)
1. Data acquistion
2. Data preprocessing (e.g. labeling)
3. Feature selection and extraction
4. Model and cost funcion selection (or design)
5. Hyperparameter selection (e.g. optimizer, learning rate)
6. Model training
7. Model testing
ML approaches
- Supervised learning (지도 학습)
  ➜ 입력(inputs)과 올바른 출력값(desired targert; 정답)을 주어준다.
  (-/+) target을 찾기가 어렵지만, 좀더 속도가 빠르다.
- Unsupervised learing (비지도 학습)
  ➜ 입력(inputs)만 주어준다.
  (+/-) target을 제공할 필요가 없지만, 문제가 잘 안풀릴 가능성이 더 높다.
- Reinforcement learning (강화 학습)
  ➜ 피드백(reward/penalty)을 통해 학습하는 방식이다.
targe(목적)에 따른 ML problem의 유형
- Classification
  ➜ 주어진 데이터의 카테고리를 분류하는 작업. (discrete)
- Regression
  ➜ 주어진 데이터에 적합한 parameters를 찾는 작업. (continuous)
  ex) Line fitting
- Clustering
  ➜ 유사한 데이터끼리 집합으로 묶는 작업.
- Dimensionality reduction
  ➜ 주어진 데이터의 차원을 낮추는 작업.
- Model selection
  ➜ 주어진 데이터에 적합한 parameters를 찾는데, 타입도 찾아주는 작업.
- Reinforcement learning
  ➜ 피드백을 통해 누적 보상액(reward)가 최대가 될 수 있는 정책(policy)를 찾는 작업.

▼ scikit-learn

scikit-learn

파이썬에서 사용하는 '머신 러닝' 라이브러리 중 하나이다.

Iris flower dataset

Loading the Iris flower dataset

from sklearn import datasets

iris = datasets.load_iris()

print(iris.target_names) 	# ['setosa' 'versicolor' 'virginica']
print(iris.feature_names) 	# ['sepal length (cm)', ...]
print(iris.data.shape) 		# (150, 4)
print(iris.target.shape) 	# (150,)

How to use (3 steps)
1. Instantiation
2. Training (fit : 학습 function)
3. Testing (predict)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import (datasets, svm)
from matplotlib.lines import Line2D # For the custom legend

# Load a dataset
iris = datasets.load_iris()

# Train a model
model = svm.SVC() 					# Accuracy: 0.973 (146/150)
model.fit(iris.data, iris.target) 	# Try 'iris.data[:,0:2]' (Accuracy: 0.820)

# Test the model
predict = model.predict(iris.data) 	# Try 'iris.data[:,0:2]' (Accuracy: 0.820)
n_correct = sum(predict == iris.target)
accuracy = n_correct / len(iris.data)

# Visualize testing results
cmap = np.array([(1, 0, 0), (0, 1, 0), (0, 0, 1)])
clabel = [Line2D([0], [0], marker='o', lw=0, label=iris.target_names[i], color=cmap[i]) for i in range(len(cmap))]
for (x, y) in [(0, 1), (2, 3)]:
	plt.figure ()
	plt.title(f'svm.SVC ({n_correct}/{len(iris.data)}={accuracy:.3f})')
	plt.scatter(iris.data[:,x], iris.data[:,y], c=cmap[iris.target], edgecolors=cmap[predict])
	plt.xlabel(iris.feature_names[x])
	plt.ylabel(iris.feature_names[y])
	plt.legend(handles=clabel, framealpha=0.5)
plt.show()

▼ Classification

Binary classification

스팸 메일을 분류할 때, 해당 e-mail이 스팸인지 아닌지를 결정하는 경우를 예로 들 수 있다.
Accurarcy: # of correct / # of data
➜ 아래처럼 데이터의 개수가 다르면 뭐를 return 하느냐에 따라 정확도가 달라진다.
➜ 비율을 맞춰주거나 더 적은 데이터에 가중치를 주어서 해결 가능 !

def is_spam_always_true(text):
	return True # Accuracy: 0.1 (10/100)

def is_spam_always_false(text):
	return False # Accuracy: 0.9 (90/100)

Confusion matrix
➜ 해당 클래스가 맞았는지 틀렸는지를 2차원으로 생각하는 방식.
Actual class(스팸인지 아닌지) / Predicted class(예측이 맞았는지 틀렸지)
- Accuracy : 스팸(P) vs. 일반 메일(N)
- Balanced accuracy : 스팸 중에서 얼마나 맞았는지 vs. 일반 메일 중에서 얼마나 맞았는지
- Recall : Positive라고 한 것 중에 얼마만큼이 맞았는지
- Precision : 맞았다고 한 것 중에 Positive가 얼마만큼 있는지
- F1-measure : Recall + Pecision
- False positive rate : Negative라고 한 것 중에 얼마만큼 맞았는지 (높을수록 Bad)

ROC curve (Receiver Operating Characteristic curve)

from sklearn import metrics

y_true = [False] * 90 + [True] * 10 # True labels
y_pred = [False] * 99 + [True] * 1 # Predicted labels

print(metrics.accuracy_score(y_true, y_pred)) 					# 0.91
print(metrics.balanced_accuracy_score(y_true, y_pred)) 			# 0.55
print(metrics.precision_score(y_true, y_pred)) 					# 1.0 = 1 / 1
print(metrics.recall_score(y_true, y_pred)) 					# 0.1 = 1 / 10
print(metrics.f1_score(y_true, y_pred)) 						# 0.18 = 2 * (1.0*0.1) / (1.0+0.1)
print(metrics.precision_recall_fscore_support(y_true, y_pred)) 	# ...
print(metrics.classification_report(y_true, y_pred))

conf_matx = metrics.confusion_matrix(y_true, y_pred) 			# np.array([[90, 0], [9, 1]], dtype=int64)
																# x축과 y축이 뒤집어져서 나타난다.
conf_disp = metrics.ConfusionMatrixDisplay(conf_matx, display_labels=['false', 'true'])
conf_disp.plot()

Multiclass classification
➜ 4를 9로, 3을 5로, 6을 5로 판단하는 경우가 많은 것을 알 수 있다. (classification을 잘하지 못한 부분)
➜ 잘된 부분과 못한 부분 모두 확인 가능

▼ Regression

▼ Clustering

▼ Machine Learning FAQ

Machine Learning FAQ

학습을 통해 좋은 결과를 얻기 위해선 많은 'trials(moer data/computing power)' 도 필요하지만, 직관(intuition)과 experience(경험)도 중요하다.
- Good hyperparameter values ?
  ➜ Grid 탐색을 통해서 모든 경우를 시도해보거나 직관이나 경험을 활용하자.
- Good model ? Enought training ?
  ➜ 주어진 데이터에 비해서 모델이 너무 가볍거나 복잡한 경우 아래와 같은 문제 발생.
  ➜ Underfitting vs. Overfitting
  ➜ 적당한 모델을 어떻게 선택 ?
  ➜ Training data(무에서 유로 갈때 필요한 데이터), validation data(검증하기 위한 데이터), 그리고 test data(성능 측정하기 위해 남겨놓은 데이터)를 따로 분리해서 결과를 보고 선택하자 !
- Enough data ?
  ➜ 데이터가 많은 것은 항상 좋지만, 데이터의 차원(dimenstion)과 모델의 파라미터(parameter)의 수를 고려해야 한다.

Underfitting(training을 더 오래할 필요!) vs. Overfitting(더 가벼운 모델을 쓸 필요!)

Cross-validation (CV)
➜ 훈련된 모델이 얼마나 일반화가 잘 되었는지 !
➜ 즉, 학술적 연구를 할 때 같은 데이터가 있더라도 어떻게 훈련(training)하고 검증(validation)하냐에 따라 결과가 달라지기 때문에, 재현 가능한 결과를 도출하기 위해 사용하는 방법이다 !
- Exhaustive cross-validation
  ➜ 모든 경우를 다 따져보는 방법.
  C(n,p) ➜ p가 1인 경우에만 가끔 사용한다.
- Non-exhaustive crowss-validation
  ➜ 전체 데이터를 k개 정도로 줄여서 검증하는 방법. (k-fold cross-validation)