데이터 로드¶

In [1]:

# 라이브러리
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# https://www.kaggle.com/c/titanic/data
titanic_df = pd.read_csv('titanic_train.csv')
titanic_df.head(3)

Out[1]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

Passengerid: 탑승자 데이터 일련번호
Survived: 생존 여부, 0 = 사망, 1 = 생존
pclass: 티켓 선실 등급, 1 = 일등석, 2 = 이등석, 3 = 삼등석
sex: 탑승자 성별, name: 탑승자 이름, Age: 탑승자 나이
sibsp: 같이 탑승한 형제자매 또는 배우자 인원수
parch: 같이 탑승한 부모님 또는 어린이 인원수
ticket: 티켓 번호
fare: 요금
cabin: 선실 번호
embarked: 중간 정착 항구, C = Cherbourg, Q = Queenstown, S = Southampton

데이터 전처리¶

결측값 처리¶

In [2]:

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [3]:

titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)
titanic_df['Cabin'].fillna('N', inplace=True)
titanic_df['Embarked'].fillna('N', inplace=True)
print('데이터 세트 Null 값 개수: ', titanic_df.isnull().sum().sum())

데이터 세트 Null 값 개수:  0

남아 있는 문자열 확인¶

In [4]:

print('Sex 값 분포:\n', titanic_df['Sex'].value_counts())
print('\n Cabin 값 분포:\n', titanic_df['Cabin'].value_counts())
print('\n Embarked 값 분포:\n', titanic_df['Embarked'].value_counts())

Sex 값 분포:
 Sex
male      577
female    314
Name: count, dtype: int64

 Cabin 값 분포:
 Cabin
N              687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: count, Length: 148, dtype: int64

 Embarked 값 분포:
 Embarked
S    644
C    168
Q     77
N      2
Name: count, dtype: int64

Cabin의 경우 N이 687건으로 가장 많고, 'C23 C25 C27'과 같이 여러 Cabin이 한꺼번에 표기된 Cabin 값이 4건이 됨
선실 번호 중 등급을 나타내는 첫 번째 알파벳이 중요한 값으로 생각되므로 Cabin 속성의 경우 앞 글자만 추출

In [5]:

titanic_df['Cabin'] = titanic_df['Cabin'].str[:1]
titanic_df['Cabin'].head(3)

Out[5]:

0    N
1    C
2    N
Name: Cabin, dtype: object

생존 확률이 높은 유형의 승객 분석¶

성별에 따른 생존률 비교¶

In [6]:

titanic_df.groupby(['Sex', 'Survived'])['Survived'].count()

Out[6]:

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

In [7]:

sns.barplot(x='Sex', y='Survived', data=titanic_df)

Out[7]:

<Axes: xlabel='Sex', ylabel='Survived'>

먼저 탑승 인원을 살펴보면 남자가 훨씬 많은 것으로 보이는데, 생존률은 여자가 월등히 높게 나타남

객실 등급, 성별에 따른 생존률 비교¶

In [8]:

sns.barplot(x='Pclass', y='Survived', hue='Sex', data=titanic_df)

Out[8]:

<Axes: xlabel='Pclass', ylabel='Survived'>

여자의 경우 일, 이등실은 생존 확률의 차이가 크지 않으나 삼등실의 경우 상대적으로 많이 떨어짐
남자의 경우 일등실의 생존 확률이 이, 삼등실의 생존 확률보다 훨씬 높음

나이에 따른 생존률 비교¶

In [9]:

# 입력 나이에 따라 구분 값을 반환하는 함수 설정, DataFrame의 apply lambda 식에 사용
def get_category(age):
    cat = ''
    if age <= -1: cat = 'Unknown'
    elif age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

# 막대그래프의 크기 figure를 더 크게 설정
plt.figure(figsize=(10, 6))

# X축의 값을 순차적으로 표시하기 위한 설정
group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Elderly']

# lambda 식에 위에서 생성한 get_category() 함수를 반환값으로 지정
# get_category(x)는 입력값으로 'Age' 컬럼 값을 받아 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x: get_category(x))
sns.barplot(x='Age_cat', y='Survived', hue='Sex', data=titanic_df, order=group_names)
titanic_df.drop('Age_cat', axis=1, inplace=True)

Out[9]:

<Figure size 1000x600 with 0 Axes>

Out[9]:

<Axes: xlabel='Age_cat', ylabel='Survived'>

여자 child의 경우 다른 연령대에 비해 생존 확률이 낮고, 여자 Elderly의 경우 생존 확률이 매우 높음
남자의 생존 확률을 살펴보면 baby, child 연령대가 가장 높게 나타남, 그보다 많은 연령대에서는 낮은 생존률을 보임
그래프들을 살펴보면 성별, 나이, 선실 등급 등이 생존을 좌우하는 중요한 요소임을 확인 가능

실제로 바다에서 사고가 날 경우 여성과 아이들, 그리고 노약자가 제일 먼저 구조되는 대상이라고 함

문자열을 숫자형으로 변경¶

In [10]:

from sklearn.preprocessing import LabelEncoder

def encode_features(dataDF):
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(dataDF[feature])
        dataDF[feature] = le.transform(dataDF[feature])
        
    return dataDF

titanic_df = encode_features(titanic_df)
titanic_df.head()

Out[10]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	7	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	2	0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	7	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	2	3
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	7	3

Sex, Cabin, Embarked 속성이 숫자형으로 변경됨

전처리 과정 함수로 정리¶

In [11]:

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Cabin'].fillna('N', inplace=True)
    df['Embarked'].fillna('N', inplace=True)
    df['Fare'].fillna(0, inplace=True)
    return df

# 머신러닝 알고리즘에 불필요한 피처 제거
def drop_features(df):
    df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
    return df

# 레이블 인코딩 수행
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 데이터 전처리 함수 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df

데이터 가공¶

In [12]:

# 원본 데이터를 재로딩 후 피처 데이터 세트와 레이블 데이터 세트 추출
titanic_df = pd.read_csv('titanic_train.csv')

y_titanic_df = titanic_df['Survived']
X_titanic_df = titanic_df.drop('Survived', axis=1)

X_titanic_df = transform_features(X_titanic_df)  # 위에서 만들었던 전처리 함수 적용

훈련, 테스트 세트 분리¶

In [13]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, test_size=0.2, random_state=11)

사이킷런 Classifier 클래스 생성¶

In [14]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 결정트리, RandomForest, 로지스틱 회귀를 위한 사이킷런 Classifier 클래스 생성

dt_clf = DecisionTreeClassifier(random_state=11)
rf_clf = RandomForestClassifier(random_state=11)

# 로지스틱 회귀의 최적화 알고리즘을 liblinear로 설정, 일반적으로 작은 데이터 세트에서의 이진 분류는 liblinear가 성능이 약간 더 좋음
lr_clf = LogisticRegression(solver='liblinear')

모델 학습 / 예측 / 평가¶

In [15]:

# DecisionTreeClassifier 
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
print('DecisionTreeClassifier 정확도: {0:.4f}'.format(accuracy_score(y_test, dt_pred)))

# RandomForestClassifier 
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
print('RandomForestClassifier 정확도: {0:.4f}'.format(accuracy_score(y_test, rf_pred)))

# LogisticRegression 
lr_clf.fit(X_train, y_train)
lr_pred = lr_clf.predict(X_test)
print('LogisticRegression 정확도: {0:.4f}'.format(accuracy_score(y_test, lr_pred)))

Out[15]:

DecisionTreeClassifier(random_state=11)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

DecisionTreeClassifier 정확도: 0.7877

Out[15]:

RandomForestClassifier(random_state=11)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomForestClassifier 정확도: 0.8547

Out[15]:

LogisticRegression(solver='liblinear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

LogisticRegression 정확도: 0.8659

교차 검증¶

KFold 클래스를 이용한 교차 검증¶

In [16]:

from sklearn.model_selection import KFold

def exec_kfold(clf, folds=5):
    # 폴드 세트를 5개인 KFold 객체 생성, 폴드 수만큼 예측 결과 저장을 위한 리스트 객체 생성
    kfold = KFold(n_splits=folds)
    scores = []
    
    # KFold 교차 검증 수행
    for iter_count, (train_index, test_index) in enumerate(kfold.split(X_titanic_df)):
        # X_titanic_df 데이터에서 교차 검증별로 학습과 검증 데이터를 가리키는 index 생성
        X_train, X_test = X_titanic_df.values[train_index], X_titanic_df.values[test_index]
        y_train, y_test = y_titanic_df.values[train_index], y_titanic_df.values[test_index]
        # Classifier 학습, 예측, 정확도 계산
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        scores.append(accuracy)
        print('교차 검증 {0} 정확도: {1:.4f}'.format(iter_count, accuracy))
        
    # 5개 fold에서의 평균 정확도 계산
    mean_score = np.mean(scores)
    print('평균 정확도: {0:.4f}'.format(mean_score))
# exec_kfold 호출
exec_kfold(dt_clf, folds=5)

교차 검증 0 정확도: 0.7542
교차 검증 1 정확도: 0.7809
교차 검증 2 정확도: 0.7865
교차 검증 3 정확도: 0.7697
교차 검증 4 정확도: 0.8202
평균 정확도: 0.7823

cross_val_score() API를 이용한 교차 검증¶

In [17]:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt_clf, X_titanic_df, y_titanic_df, cv=5)
for iter_count, accuracy in enumerate(scores):
    print('교차 검증 {0} 정확도: {1:.4f}'.format(iter_count, accuracy))

print('평균 정확도: {0:.4f}'.format(np.mean(scores)))

교차 검증 0 정확도: 0.7430
교차 검증 1 정확도: 0.7753
교차 검증 2 정확도: 0.7921
교차 검증 3 정확도: 0.7865
교차 검증 4 정확도: 0.8427
평균 정확도: 0.7879

K 폴드와 cross_val_score()의 평균 정확도가 약간 다른 점은 cross_val_score()가 StratifiedKFold를 이용해 폴드 세트를 분할하기 때문

GridSearchCV를 이용한 최적의 하이퍼 파라미터 찾기¶

In [18]:

from sklearn.model_selection import GridSearchCV

parameters = {'max_depth': [2, 3, 5, 10],
              'min_samples_split': [2, 3, 5], 'min_samples_leaf': [1, 5, 8]}
grid_dclf = GridSearchCV(dt_clf, param_grid=parameters, scoring='accuracy', cv=5)
grid_dclf.fit(X_train, y_train)

print('GridSearchCV 최적 하이퍼 파라미터: ', grid_dclf.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dclf.best_score_))
best_dclf = grid_dclf.best_estimator_

Out[18]:

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=11),
             param_grid={'max_depth': [2, 3, 5, 10],
                         'min_samples_leaf': [1, 5, 8],
                         'min_samples_split': [2, 3, 5]},
             scoring='accuracy')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=11),
             param_grid={'max_depth': [2, 3, 5, 10],
                         'min_samples_leaf': [1, 5, 8],
                         'min_samples_split': [2, 3, 5]},
             scoring='accuracy')

estimator: DecisionTreeClassifier

DecisionTreeClassifier(random_state=11)

DecisionTreeClassifier

DecisionTreeClassifier(random_state=11)

GridSearchCV 최적 하이퍼 파라미터:  {'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.7992

In [19]:

# GridSearchCV의 최적 하이퍼 파라미터로 학습된 Estimator로 예측 및 평가 수행
dpredictions = best_dclf.predict(X_test)
accuracy = accuracy_score(y_test, dpredictions)
print('테스트 세트에서의 DecisionTreeClassifier 정확도: {0:.4f}'.format(accuracy))

테스트 세트에서의 DecisionTreeClassifier 정확도: 0.8715

최적화된 하이퍼 파라미터로 DecisionTreeClassifier를 학습시킨 뒤 예측 정확도가 향상된 것을 보여줌

타이타닉 생존자 예측

데이터 로드¶

데이터 전처리¶

결측값 처리¶

남아 있는 문자열 확인¶

생존 확률이 높은 유형의 승객 분석¶

성별에 따른 생존률 비교¶

객실 등급, 성별에 따른 생존률 비교¶

나이에 따른 생존률 비교¶

문자열을 숫자형으로 변경¶

전처리 과정 함수로 정리¶

데이터 가공¶

훈련, 테스트 세트 분리¶

사이킷런 Classifier 클래스 생성¶

모델 학습 / 예측 / 평가¶

교차 검증¶

KFold 클래스를 이용한 교차 검증¶

cross_val_score() API를 이용한 교차 검증¶

GridSearchCV를 이용한 최적의 하이퍼 파라미터 찾기¶