Basic process of modeling and evaluation:

# Zero, characteristic Engineering

Import data:

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Image plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally plt.rcParams['figure.figsize'] = (10, 6) # Set output picture size # Read training set train = pd.read_csv('train.csv')

After reading the data set, some operations need to be carried out on the data processing to facilitate the subsequent model establishment and training.

## 1. Fill in missing value:. fillna()

- Continuous variables: mean, median, mode
- Category variables: NA, most categories

# Check the proportion of missing values train.isnull().sum().sort_values(ascending=False)

Embarked 0 Cabin 0 Fare 0 Ticket 0 Parch 0 SibSp 0 Age 0 Sex 0 Name 0 Pclass 0 Survived 0 PassengerId 0 dtype: int64

## 2. Code classification variables: pandas.get_dummies()

For example, there are only two possible categories

# Extract all input features data = train[['Pclass','Sex','Age','SibSp','Parch','Fare', 'Embarked']] # Virtual variable conversion data = pd.get_dummies(data)

# 1, Model building

After processing the data, you need to establish a model. Before modeling, you need to select an appropriate model.

- First determine the types of data sets: supervised learning and unsupervised learning
- Selection basis: task, data sample size, sparsity of features
- Step: first try to use a basic model as its baseline, then compare with other models, and finally select the model with better generalization ability or performance.

## 1.1 cutting training set and test set

- Objective: to facilitate the subsequent evaluation of the generalization ability of the model
- Cutting method:
- Proportional cutting: generally 30%, 25%, 15% and 10%
- Slice proportionally according to the target variable
- Set random seed reproduction results

- Method of cutting data in sklearn: train_test_split()

When cutting a data set, there is no need for random selection: the data set itself has been randomly processed or the sample size is large enough.

from sklearn.model_selection import train_test_split # Generally, X and y are taken out before cutting. When they are not cut, X and y can be used X = data y = train['Survived'] # Cut the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) # View training set test set size X_train.shape, X_test.shape # ((668, 10), (223, 10))

## 1.2 model creation

### Model category

- Classification model based on linear model (sklearn.linear_model): logistic regression (logistic regression is the classification model, linear regression is the regression model)
- Classification model based on tree (sklearn. Ensembles): decision tree and random forest (random forest is a set of decision trees to reduce the over fitting of decision trees)

Prove why linear regression can be used for binary classification: Machine learning notes - Classification using linear models

from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier # Default parameter logistic regression model lr = LogisticRegression() lr.fit(X_train, y_train) # View training set and test set score values print("Training set score: {:.2f}".format(lr.score(X_train, y_train))) print("Testing set score: {:.2f}".format(lr.score(X_test, y_test))) # Logistic regression model after adjusting parameters lr2 = LogisticRegression(C=100) lr2.fit(X_train, y_train) print("Training set score: {:.2f}".format(lr2.score(X_train, y_train))) print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

The score on the test set increased after adjusting the parameters.

# Random forest classification model with default parameters rfc = RandomForestClassifier() rfc.fit(X_train, y_train) print("Training set score: {:.2f}".format(rfc.score(X_train, y_train))) print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test))) # Stochastic forest classification model with adjusted parameters rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5) rfc2.fit(X_train, y_train) print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train))) print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

## 1.3 output model prediction results

The general supervision model is in sklearn. Predict outputs the prediction label and predict_ Probabilities of probabilities of proba output Tags

# Forecast label pred = lr.predict(X_train) # Predicted tag probability pred_proba = lr.predict_proba(X_train)

# Two, model assessment

- Objective: to obtain the generalization ability of the model
- Method: cross validation
- The data is divided many times and multiple models need to be trained
- The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
- reference resources Li Hongyi's machine learning Task03 - error and gradient descent Section 3 of mean square error

- Accuracy measures how many of the samples predicted as positive examples are true positive examples
- recall measures how many positive samples are predicted to be positive (TP)
- f-score is the harmonic average of accuracy and recall

## 2.1 cross validation

Module in sklearn: sklearn.model_selection

from sklearn.model_selection import cross_val_score # 10 fold cross validation was used to evaluate the logistic regression model lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10) # k-fold cross validation score scores # Average cross validation score print("Average cross-validation score: {:.2f}".format(scores.mean()))

The more K-fold, the more time it takes, but the average error is regarded as a generalization error, and the result is more reliable.

## 2.2 confusion matrix

The commonly used evaluation indexes for binary classification problems are precision and recall, and the index for evaluating classifiers is classification accuracy

The classifier's prediction on the test data set is correct or incorrect. There are four cases:

- Module in sklearn: sklearn.metrics
- The confusion matrix requires the input of real labels and prediction labels

from sklearn.metrics import confusion_matrix # Training model lr = LogisticRegression(C=100) lr.fit(X_train, y_train)

# Model prediction results pred = lr.predict(X_train) # Confusion matrix confusion_matrix(y_train, pred) from sklearn.metrics import classification_report # Accuracy, recall and F1 score print(classification_report(y_train, pred))

## 2.3 ROC curve

- The module of ROC curve in sklearn is sklearn.metrics
- The larger the area surrounded by the ROC curve, the better

from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR (recall)") # The threshold closest to 0 was found close_zero = np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2) plt.legend(loc=4)