확률 모델의 비용 민감 모델 구축하는 과정에서 valueerror

97dingdong 2022. 12. 31. 21:31

2022. 12. 31. 21:31

[학습자료]

패스트캠퍼스 “파이썬을 활용한 데이터 전처리 Level UP 올인원 패키지 Online.”

확률 모델 비용 민감 모델 구축하는 과정에서 cut off value 값에 따른 재현율과 정밀도 변화를 확인하는 과정에서의 오류

일단, 로지스틱 회귀모델을 만들고 cut off value를 조정하는 함수를 작성했다.

# cut off value를 조정하는 함수 작성
def cost_sensitive_model(model, cut_off_value, x_test, y_test):
    probs = model.predict(x_test)
    probs = pd.DataFrame(probs, columns = model.classes_)
    pred_y = 2 * (probs.iloc[:, -1] > cut_off_value) -1
    recall = recall_score(y_test, pred_y)
    accuracy = accuracy_score(y_test, pred_y)
    return recall, accuracy

그 후, cut off value에 따른 재현율과 정확도의 변화를 확인하기 위해 plot을 찍는 과정에서 오류가 발생했다.

# cut off value에 따른 recall과 accuracy 변화 확인
from matplotlib import pyplot as plt
import numpy as np

cut_off_value_list = np.linspace(0, 1, 101)
recall_list = []
accuracy_list = []

for c in cut_off_value_list:
    recall, accuracy = cost_sensitive_model(model, c, x_test, y_test)
    recall_list.append(recall)
    accuracy_list.append(accuracy)

%matplotlib inline
plt.plot(cut_off_value_list, recall_list, label = 'recall')
plt.plot(cut_off_value_list, accuray_list, label = 'accuracy')
plt.legend()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [38], in <cell line: 9>()
      7 accuracy_list = []
      9 for c in cut_off_value_list:
---> 10     recall, accuracy = cost_sensitive_model(model, c, x_test, y_test)
     11     recall_list.append(recall)
     12     accuracy_list.append(accuracy)

Input In [37], in cost_sensitive_model(model, cut_off_value, x_test, y_test)
      2 def cost_sensitive_model(model, cut_off_value, x_test, y_test):
      3     probs = model.predict(x_test)
----> 4     probs = pd.DataFrame(probs, columns = model.classes_)
      5     pred_y = 2 * (probs.iloc[:, -1] > cut_off_value) -1
      6     recall = recall_score(y_test, pred_y)

File ~\anaconda3\lib\site-packages\pandas\core\frame.py:694, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    684         mgr = dict_to_mgr(
    685             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    686             # attribute "name"
   (...)
    691             typ=manager,
    692         )
    693     else:
--> 694         mgr = ndarray_to_mgr(
    695             data,
    696             index,
    697             columns,
    698             dtype=dtype,
    699             copy=copy,
    700             typ=manager,
    701         )
    703 # For data is list-like, or Iterable (will consume into list)
    704 elif is_list_like(data):

File ~\anaconda3\lib\site-packages\pandas\core\internals\construction.py:351, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    346 # _prep_ndarray ensures that values.ndim == 2 at this point
    347 index, columns = _get_axes(
    348     values.shape[0], values.shape[1], index=index, columns=columns
    349 )
--> 351 _check_values_indices_shape_match(values, index, columns)
    353 if typ == "array":
    355     if issubclass(values.dtype.type, str):

File ~\anaconda3\lib\site-packages\pandas\core\internals\construction.py:422, in _check_values_indices_shape_match(values, index, columns)
    420 passed = values.shape
    421 implied = (len(index), len(columns))
--> 422 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (392, 1), indices imply (392, 2)

함수를 정의한 코드를 자세히 보니 실수를 했었다...

cut off value를 활용하기 위해서는 predict 예측값이 아닌 predict_proba로 예측확률을 봤어야 했는데 predict로 코드를 짜서 shape가 맞질 않았던 것이다.

predict_proba를 쓰면 2차원 배열에 각 클래스를 예측한 확률로 나와서 shape가 (392, 2)로 나오는데 predict를 쓰면 1차원 배열에 클래스를 예측한 값 즉, -1과 1이 나오기 때문에 shape가 (392, 1)로 나와서 작성했던 함수에 'cut off value에 따른 recall과 accuracy 변화 확인' 코드가 먹히지 않았던 것이다.

다시말해, 만든 함수는 predict를 사용해서 1차원 배열이지만, 함수 작성 후 내가 쓴 코드는 2차원 배열꼴을 넣어서 shape가 맞지 않아 오류가 발생했던 것이었다.

코드를 쓸 때 나타나는 사소한 실수지만 뭐가 잘못된지 모르고 10분 이상을 코드를 재확인했다... 발견해서 다행 :)

cut off value를 조정하는 함수 작성 수정 코드

# cut off value를 조정하는 함수 작성
def cost_sensitive_model(model, cut_off_value, x_test, y_test):
    probs = model.predict_proba(x_test)
    probs = pd.DataFrame(probs, columns = model.classes_)
    pred_y = 2 * (probs.iloc[:, -1] >= cut_off_value) -1
    recall = recall_score(y_test, pred_y)
    accuracy = accuracy_score(y_test, pred_y)
    return recall, accuracy

2번째 줄에 predict를 predict_proba로 수정해서 문제 해결 완료~

'Errors' 카테고리의 다른 글

xgboost 지도학습 중 발생한 error (0)	2023.01.21

study record

확률 모델의 비용 민감 모델 구축하는 과정에서 valueerror

'Errors' 카테고리의 다른 글

+ Recent posts

티스토리툴바