('데이터 전처리' 카테고리에서 '데이터 전처리 정리 및 실습'  파트에서 발생한 오류)

# 특징 선택과 모델 하이퍼 파라미터 튜닝
from sklearn.ensemble import RandomForestClassifier as RFC
from xgboost import XGBClassifier as XGB
from lightgbm import LGBMClassifier as LGBM
from sklearn.feature_selection import *
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import f1_score

# 모델 파라미터 그리드 설계
model_parameter_grid = dict()
model_parameter_grid[RFC] = ParameterGrid({'max_depth':[2, 3, 4],
                                          'n_estimators':[50, 100]})
model_parameter_grid[XGB] = ParameterGrid({'max_depth':[2, 3, 4],
                                          'n_estimators':[50, 100],
                                          'learning_rate':[0.05, 0.1, 0.15, 0.2]})
model_parameter_grid[LGBM] = ParameterGrid({'max_depth':[2, 3, 4],
                                           'n_estimators':[50, 100],
                                           'learning_rate':[0.05, 0.1, 0.15, 0.2]})
# 튜닝 시작
best_score = 0
for k in range(30, 5, -1):
    s_x_train = x_train[pvals.iloc[:k].index]
    s_x_test = x_test[pvals.iloc[:k].index]
    for M in model_parameter_grid.keys():
        for P in model_parameter_grid[M]:
            model = M(**P).fit(s_x_train, y_train)
            pred = model.predict(s_x_test)
            score = f1_score(y_test, pred)
            if score > best_score:
                best_score = score
                best_feature = s_x_train.columns
                best_model = M
                best_parameter = P

print(best_score)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [78], in <cell line: 3>()
      6 for M in model_parameter_grid.keys():
      7     for P in model_parameter_grid[M]:
----> 8         model = M(**P).fit(s_x_train, y_train)
      9         pred = model.predict(s_x_test)
     10         score = f1_score(y_test, pred)

File ~\AppData\Roaming\Python\Python39\site-packages\xgboost\core.py:575, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
    573 for k, arg in zip(sig.parameters, args):
    574     kwargs[k] = arg
--> 575 return f(**kwargs)

File ~\AppData\Roaming\Python\Python39\site-packages\xgboost\sklearn.py:1357, in XGBClassifier.fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1352     expected_classes = np.arange(self.n_classes_)
   1353 if (
   1354     self.classes_.shape != expected_classes.shape
   1355     or not (self.classes_ == expected_classes).all()
   1356 ):
-> 1357     raise ValueError(
   1358         f"Invalid classes inferred from unique values of `y`.  "
   1359         f"Expected: {expected_classes}, got {self.classes_}"
   1360     )
   1362 params = self.get_xgb_params()
   1364 if callable(self.objective):

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1], got [-1  1]

 

클래스가 0부터 시작해야 하기 때문에 발생합니다(버전 1.3.2부터 필요함)

이를 해결하는 쉬운 방법은 sklearn.preprocessing 라이브러리의 LabelEncoder를 사용하는 것입니다.

 

솔루션

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)

y_train의 레이블을 새로 인코딩 한 뒤 학습을 하니 해결이 가능했다.

 

이후 모델 평가 시 주의사항

le.inverse_transform으로 인코딩된 값을 다시 디코딩해서 새로 들어오는 데이터 즉, 테스트 데이터의 레이블 값(y_test)의 원본 형태인 -1과 1로 맞추어야 score를 구할 수 있음

# 튜닝 시작
best_score = 0
for k in range(30, 5, -1):
    s_x_train = x_train[pvals.iloc[:k].index]
    s_x_test = x_test[pvals.iloc[:k].index]
    for M in model_parameter_grid.keys():
        for P in model_parameter_grid[M]:
            model = M(**P).fit(s_x_train, y_train)
            pred = model.predict(s_x_test)
            pred = le.inverse_transform(pred)
            score = f1_score(y_test, pred)
            if score > best_score:
                best_score = score
                best_feature = s_x_train.columns
                best_model = M
                best_parameter = P

print(best_score)
print('\n')
print(best_feature)
print('\n')
print(best_model)
print('\n')
print(best_parameter)

 

출처:https://stackoverflow.com/questions/71996617/invalid-classes-inferred-from-unique-values-of-y-expected-0-1-2-3-4-5-got

+ Recent posts