머신러닝 2024. 1. 31. 14:01

워싱턴 D.C.의 Capital Bikeshare 프로그램에서 자전거 대여 수요를 예측하기 위한 2014년 Kaggle 대회 프로젝트

https://www.kaggle.com/competitions/bike-sharing-demand/data

워싱턴 D.C의 Capital Bikeshare 프로그램에서 자전거 대여 수요를 예측하기 위한 프로젝트

datatime: hourly date + timestamp
season: 1 = 봄, 2 = 여름, 3 = 가을, 4 = 겨울
holiday: 1 = 토, 일요일의 주말을 제외한 국경일 등의 휴일, 0 = 휴일이 아닌 날
workingday: 1= 토, 일요일의 주말 및 휴일이 아닌 주중, 0 = 주말 및 휴일
weather:
- 1 = 맑음, 약간 구름 낀 흐림
- 2 = 안개, 안개 + 흐림
- 3 = 가벼운 눈, 가벼운 비 + 천둥
- 4 = 심한 눈/비, 천둥/번개
temp: 온도(섭씨)
atemp: 체감온도(섭씨)
humidity: 상대습도
windspeed: 풍속
casual: 사전에 등록되지 않는 사용자가 대여한 횟수
registered: 사전에 등록된 사용자가 대여한 횟수
count: 대여 횟수 < 결정 값

필요 라이브러리¶

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

데이터 가공 및 시각화¶

데이터 가공¶

In [2]:

# https://www.kaggle.com/competitions/bike-sharing-demand/data
bike_df = pd.read_csv('./datasets/bike_train.csv')
bike_df.shape
bike_df.head()

Out[2]:

(10886, 12)

Out[2]:

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

In [3]:

bike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

Null 값은 없으며, datetime 칼럼을 제외한 칼럼은 모두 숫자형이므로 object 형인 datetime 칼럼은 '년-월-일 시:분:초' 문자 형식으로 되어 있어 데이터 가공이 필요하다.
-> 연도, 월, 일, 시간으로 4개 속성으로 분리

In [4]:

# 문자열을 datetime 타입으로 변경
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)

# datetime 타입에서 년, 월, 일, 시간 추출
bike_df['year'] = bike_df.datetime.apply(lambda x : x.year)
bike_df['month'] = bike_df.datetime.apply(lambda x : x.month)
bike_df['day'] = bike_df.datetime.apply(lambda x : x.day)
bike_df['hour'] = bike_df.datetime.apply(lambda x : x.hour)

bike_df.head(3)

Out[4]:

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	year	month	day	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	2011	1	1	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	2011	1	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	2011	1	1	2

시간을 4개 속성으로 분리했으므로 기존 datetime 칼럼 삭제
casual과 registered 칼럼은 각각 사전에 등록하지 않은 사용자의 대여 횟수, 사전에 등록한 사용자의 대여 횟수인데 casual + registered = count이므로 따로 나눌 필요 없어 두 칼럼도 삭제

In [5]:

drop_cols = ['datetime', 'casual', 'registered']
bike_df.drop(drop_cols, axis=1, inplace=True)
bike_df.head(3)

Out[5]:

	season	weather	temp	atemp	humidity	count	year	month	day	hour
0	1	1	9.84	14.395	81	16	2011	1	1	0
1	1	1	9.02	13.635	80	40	2011	1	1	1
2	1	1	9.02	13.635	80	32	2011	1	1	2

데이터 시각화¶

주요 칼럼별로 Target 값인 count(대여 횟수) 분포 확인

In [6]:

# 막대그래프
fig, axs = plt.subplots(figsize=(16, 12), ncols=4, nrows=2)
plt.tight_layout()

cat_features = ['year', 'month', 'season', 'weather', 'day', 'hour', 'holiday', 'workingday']

# cat_features에 있는 모든 칼럼별로 개별 칼럼값에 따른 count의 합을 barplot으로 시각화
for i, feature in enumerate(cat_features):
    row = int(i/4)
    col = i%4
    # 시본의 barplot을 이용해 칼럼값에 따른 count 합 표현
    sns.barplot(x=feature, y='count', data=bike_df, ax=axs[row][col])

Out[6]:

<Axes: xlabel='year', ylabel='count'>

Out[6]:

<Axes: xlabel='month', ylabel='count'>

Out[6]:

<Axes: xlabel='season', ylabel='count'>

Out[6]:

<Axes: xlabel='weather', ylabel='count'>

Out[6]:

<Axes: xlabel='day', ylabel='count'>

Out[6]:

<Axes: xlabel='hour', ylabel='count'>

Out[6]:

<Axes: xlabel='holiday', ylabel='count'>

Out[6]:

<Axes: xlabel='workingday', ylabel='count'>

year(연도)별 count를 보면 2012년이 2011년보다 상대적으로 값이 높다. 특별한 의미보다는 시간이 지날수록 자전거 대여 횟수가 지속적으로 증가한 결과로 보인다.
month(월)별의 경우 1, 2, 3월이 낮고, 6, 7, 8, 9월이 높다. season(계절)을 보면 봄(1), 겨울(4)가 낮고 여름(2)과 가을(3)이 높다.
weather(날씨)의 경우 눈 또는 비가 있는 경우(3, 4)가 낮고, 맑거나 약간 안개가 있는 경우(1, 2)가 높다.
hour(시간)의 경우 오전 출근 시간(8)과 오후 퇴근 시간(17, 18)이 상대적으로 높다. day 간의 차이는 크지 않다.
holiday(휴일 여부) 또는 workingday(주중 여부)는 주중일 경우(즉, holiday 0, workingday 1)가 상대적으로 약간 높다.

In [7]:

# 박스플롯
fig, axs = plt.subplots(figsize=(12, 8), nrows=2, ncols=2) # 2행 2열
boxplot_features = ['season', 'weather', 'holiday', 'workingday']

for i, feature in enumerate(boxplot_features):
    row = int(i/2)
    col = i%2
    # 시본의 boxplot을 이용해 칼럼값에 따른 count 합 표현
    sns.boxplot(x=feature, y='count', data=bike_df, ax=axs[row][col])

plt.tight_layout()
plt.show()

Out[7]:

<Axes: xlabel='season', ylabel='count'>

Out[7]:

<Axes: xlabel='weather', ylabel='count'>

Out[7]:

<Axes: xlabel='holiday', ylabel='count'>

Out[7]:

<Axes: xlabel='workingday', ylabel='count'>

다양한 회귀 모델을 데이터 세트에 적용해 예측 성능 측정¶

캐글에서 요구한 성능 평가 방법은 RMSLE(Root Mean Square Log Error)로 오류 값의 로그에 대한 RMSE

사이킷런은 RMSLE를 제공하지 않으므로 RMSLE를 수행하는 성능 평가 함수 만들기

In [8]:

# log 값 변환 시 NaN 등의 이슈로 log()가 아닌 log1p()를 이용해 RMSLE 계산
def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred)
    squared_error = (log_y - log_pred) ** 2
    rmsle = np.sqrt(np.mean(squared_error))
    return rmsle

# 사이킷런의 mean_square_error()를 이용해 RMSE 계산
def rmse(y, pred):
    return np.sqrt(mean_squared_error(y, pred))

# MAE, RMSE, RMSLE 모두 계산
def evaluate_regr(y, pred):
    rmsle_val = rmsle(y, pred)
    rmse_val = rmse(y, pred)
    # MAE(평균 절대 오차)는 사이킷런의 mean_absolute_error()로 계산
    mae_val = mean_absolute_error(y, pred)
    print('RMSLE: {0:.3f}, RMSE: {1:.3F}, MAE: {2:.3F}'.format(rmsle_val, rmse_val, mae_val))

위의 rmsle() 함수를 만들 때 주의할 것:
rmsle를 구할 때 넘파이의 log() 함수를 이용하거나 사이킷런의 mean_squared_log_error()를 이용할 수도 있지만, 데이터 값의 크기에 따라 오버플로/언더플로 오류가 발생하기 쉽다.

In [9]:

# 다음과 같은 rmsle 구현은 오버플로나 언더플로 오류를 발생하기 쉬워 주의할 것
def rmsle_error(y, pred):
    msle = mean_squared_log_error(y, pred)
    rmsle_error = np.sqrt(mse)
    return rmsle_error

log()보다는 log1p()를 이용하는데, log1p()의 경우 1 + log() 값으로 log 변환값에 1을 더해 문제를 해결
log1p()로 변환된 값은 다시 넘파이의 expm1() 함수로 쉽게 원래의 스케일로 복원될 수 있다.

로그 변환, 피처 인코딩과 모델 학습 / 예측 / 평가¶

결괏값이 정규 분포로 되어 있는지 확인하고, 카테고리형 회귀 모델의 경우 원-핫 인코딩으로 피처 인코딩

In [10]:

y_target = bike_df['count']
X_features = bike_df.drop(['count'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.3, random_state=0)

lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)
pred = lr_reg.predict(X_test)

evaluate_regr(y_test, pred)

Out[10]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RMSLE: 1.165, RMSE: 140.900, MAE: 105.924

실제 Target 데이터 값인 대여 횟수(Count)를 감안하면 예측 오류로서는 비교적 큰 값이므로, 실제 값과 예측 값이 어느 정도 차이가 나는지 오류 값이 큰 순서대로 확인

In [11]:

def get_top_error_data(y_test, pred, n_tops=5):
    # DataFrame의 칼럼으로 실제 대여 횟수(count)와 예측 값을 서로 비교할 수 있도록 생성
    result_df = pd.DataFrame(y_test.values, columns=['real_count'])
    result_df['predicted_count'] = np.round(pred)
    result_df['diff'] = np.abs(result_df['real_count'] - result_df['predicted_count'])

# 예측값과 실제 값이 가장 큰 데이터 순으로 출력
    print(result_df.sort_values('diff', ascending=False)[:n_tops])
    
get_top_error_data(y_test, pred, n_tops=5)

      real_count  predicted_count   diff
1618         890            322.0  568.0
3151         798            241.0  557.0
966          884            327.0  557.0
412          745            194.0  551.0
2817         856            310.0  546.0

In [12]:

# Target 값의 분포가 왜곡된 형태인지 확인
sns.displot(y_target)

C:\Users\mit012\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Out[12]:

<seaborn.axisgrid.FacetGrid at 0x1da661ffa50>

왜곡된 값을 정규 분포 형태로 바꾸는 가장 일반적인 방법은 로그를 적용해 변환하는 것
변경된 Target 값을 기반으로 학습하고, 예측한 값은 다시 expm1() 함수를 적용해 원래 scale 값으로 원상 복구

In [13]:

y_log_transform = np.log1p(y_target)
sns.displot(y_log_transform)

C:\Users\mit012\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Out[13]:

<seaborn.axisgrid.FacetGrid at 0x1da646c8610>

In [14]:

# 타겟 칼럼인 count 값을 log1p로 로그 변환
y_target_log = np.log1p(y_target)

# 로그 변환된 y_target_log를 반영해 학습 / 테스트 데이터 세트 분할
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target_log, test_size=0.3, random_state=0)

lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)
pred = lr_reg.predict(X_test)

# 테스트 데이터 세트의 Target 값은 로그 변환됐으므로 다시 expm1을 이용해 원래 스케일로 변환
y_test_exp = np.expm1(y_test)

# 예측값 역시 로그 변환된 타겟 기반으로 학습되어 예측됐으므로 스케일 변환
pred_exp = np.expm1(pred)

evaluate_regr(y_test_exp, pred_exp)

Out[14]:

LinearRegression()

RMSLE: 1.017, RMSE: 162.594, MAE: 109.286

RMSLE 오류는 줄었지만, RMSE는 오히려 늘어났다. 개별 피처들의 인코딩을 적용하기 전, 각 피처의 회귀 계숫값 시각화.

In [15]:

coef = pd.Series(lr_reg.coef_, index=X_features.columns)
coef_sort = coef.sort_values(ascending=False)

sns.barplot(x=coef_sort.values, y=coef_sort.index)

Out[15]:

<Axes: >

year, hour, month, season, holiday, workingday 피처들의 회귀 계수 영향도가 상대적으로 높다.
이들 피처를 살펴보면 year는 2011, 2012 값으로, month는 1~12와 같이 숫자 값 형태로 의미를 담고 있다.
- 이들 피처들의 경우 개별 숫자 값의 크기가 의미가 있는 것이 아니다.
- year의 경우 단순히 연도를 뜻하는 것이므로 2012라는 값이 2011보다 큰 값으로 인식되어서는 안 된다.

-> year, hour, month 등은 숫자 값으로 표시되었지만 모두 범주(Category)형 피처

사이킷런은 카테고리만을 위한 데이터 타입은 없으며, 모두 숫자로 변환해야 한다.

하지만 숫자형 카테고리 값을 선형 회귀에 사용할 경우 회귀 계수를 연산할 때 숫자형 값에 크게 영향을 받을 것.

-> 이러한 피처 인코딩에는 원-핫 인코딩 적용

In [16]:

# 'year', 'month', 'day', 'hour' 등의 피처들을 One-Hot Encoding
X_features_ohe = pd.get_dummies(X_features, columns=['year', 'month', 'day', 'hour', 'holiday', 'workingday', 'season', 'weather'])

In [17]:

# 원-핫 인코딩 적용된 피처 데이터 세트 기반으로 학습 / 예측 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target_log, test_size=0.3, random_state=0)

# 모델과 학습 / 테스트 데이터 세트를 입력하면 성능 평가 수치를 반환
def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    if is_expm1:
        y_test = np.expm1(y_test)
        pred = np.expm1(pred)
    print('###', model.__class__.__name__,'###')
    evaluate_regr(y_test, pred)

# 모델별로 평가 수행
lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha=0.01)

for model in [lr_reg, ridge_reg, lasso_reg]:
    get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=True)

### LinearRegression ###
RMSLE: 0.590, RMSE: 97.688, MAE: 63.382
### Ridge ###
RMSLE: 0.590, RMSE: 98.529, MAE: 63.893
### Lasso ###
RMSLE: 0.635, RMSE: 113.219, MAE: 72.803

원-핫 인코딩 적용 후 선형 회귀의 예측 성능이 향상된 것을 알 수 있다.

In [18]:

# 인코딩된 데이터 세트에서 회귀 계수가 높은 피처 시각화, 원-핫 인코딩으로 피처가 늘어났으므로 회귀 계수 상위 20개 피처 추출
coef = pd.Series(lr_reg.coef_, index=X_features_ohe.columns)
coef_sort = coef.sort_values(ascending=False)[:20]

sns.barplot(x=coef_sort.values, y=coef_sort.index)

Out[18]:

<Axes: >

원-핫 인코딩을 통해 피처들의 영향도가 달라졌고, 모델의 성능도 향상됐다.
반드시 그런 것은 아니지만, 선형 회귀의 경우 중요 카테고리성 피처들을 원-핫 인코딩으로 변환하는 것은 성능에 중요한 영향을 미칠 수 있다.

In [19]:

# 랜덤 포레스트, GBM, XGBoost, LightGBM model 별로 평가 수행
rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)
xgb_reg = XGBRegressor(n_estimators=500)
lgbm_reg = LGBMRegressor(n_estimators=500)

models = [rf_reg, gbm_reg, xgb_reg, lgbm_reg]

for model in models:
    # XGBoost의 경우 DataFrame이 입력될 경우 버전에 따라 오류 발생 가능, ndarray로 변환
    get_model_predict(model, X_train.values, X_test.values, y_train.values, y_test.values, is_expm1=True)

### RandomForestRegressor ###
RMSLE: 0.355, RMSE: 50.460, MAE: 31.224
### GradientBoostingRegressor ###
RMSLE: 0.330, RMSE: 53.349, MAE: 32.747
### XGBRegressor ###
RMSLE: 0.342, RMSE: 51.732, MAE: 31.251
### LGBMRegressor ###
RMSLE: 0.319, RMSE: 47.215, MAE: 29.029

앞의 선형 회귀 모델보다 회귀 예측 성능이 개선되었으나 회귀 트리가 선형 회귀보다 더 나은 성능을 가진다는 의미는 아니다.

저작자표시

'머신러닝' 카테고리의 다른 글

주택 가격: 고급 회귀 기법 실습 (0)	2024.02.02
보스턴 주택 가격 회귀 (0)	2024.01.26
사용자 행동 인식 데이터 세트 (0)	2024.01.17
피마 인디언 당뇨병 예측 (0)	2024.01.16
타이타닉 생존자 예측 (0)	2024.01.15

ABOUT ME

개발자 어쩌구 개발자 어쩌구

필요 라이브러리¶

데이터 가공 및 시각화¶

데이터 가공¶

데이터 시각화¶

다양한 회귀 모델을 데이터 세트에 적용해 예측 성능 측정¶

로그 변환, 피처 인코딩과 모델 학습 / 예측 / 평가¶

'머신러닝' 카테고리의 다른 글

티스토리툴바

ABOUT ME

필요 라이브러리¶

데이터 가공 및 시각화¶

데이터 가공¶

데이터 시각화¶

다양한 회귀 모델을 데이터 세트에 적용해 예측 성능 측정¶

로그 변환, 피처 인코딩과 모델 학습 / 예측 / 평가¶

'머신러닝' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바