Decision Tree Classification for Breast Cancer Analysis

Decision Tree Classification for Breast Cancer Analysis

In this article, you will learn to train a Keras Deep Learning model to predict breast cancer in breast histology images.

From there we’ll create a Python script to split the input dataset into three sets: Training set, Validation set and Testing set.

Perform DecisionTree Classification and Regression And Analyse Breast Cancer dataset, perform Classification

In [1]:

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
df
=pd.read_csv("/home/webtunix/Life_expectancy.csv")

Perform DecisionTree Classification to calssify the data of life expectancy

In [2]:

from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
lab_enc = preprocessing.LabelEncoder()
df
["Life expectancy"]= lab_enc.fit_transform(df["Life expectancy"])
df["Entity"]= lab_enc.fit_transform(df["Entity"])
dt_clf = DecisionTreeClassifier(random_state=0)
dt_clf.fit(df[["Entity","Year"]],df["Life expectancy"])

Out[2]:

DecisionTreeClassifier(random_state=0)

Return the index of the leaf that each sample is classified as.

In [3]:

dt_clf.apply(df[["Entity","Year"]], check_input=True)

Out[3]:

array([   4,    4,    4, ..., 4984, 4985, 4986])

Return the decision path in the tree

In [4]:

dt_clf.decision_path(df[["Entity","Year"]], check_input=True)

Out[4]:

<3253x4987 sparse matrix of type '<class 'numpy.int64'>'
	with 69595 stored elements in Compressed Sparse Row format>

In [5]:

print('The scikit-learn version is {}.'.format(sklearn.__version__))
The scikit-learn version is 0.24.2.

Get parameters for this dataset

In [6]:

dt_clf.get_params()

Out[6]:

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 0,
 'splitter': 'best'}

Return the mean accuracy on the given test data and labels

In [7]:

dt_clf.score(df[["Entity","Year"]],df["Life expectancy"], sample_weight=None)

Out[7]:

1.0

Predict class for each node of dataset

In [8]:

dt_clf.predict(df[["Entity","Year"]])

Out[8]:

array([ 348,  348,  348, ..., 2052, 2050, 2049])

Perform Decision Tree Regression on life expectancy dataset

In [9]:

from sklearn.tree import DecisionTreeRegressor
lab_enc
= preprocessing.LabelEncoder()
expectancy= lab_enc.fit_transform(df["Life expectancy"])
year= lab_enc.fit_transform(df["Year"])
dt_rgs = DecisionTreeRegressor(random_state=0)
dt_rgs.fit(year.reshape(-1,1),expectancy)

Out[9]:

DecisionTreeRegressor(random_state=0)

Predict the regresser value

In [10]:

dt_rgs.predict(year.reshape(-1,1))

Out[10]:

array([ 334.4       ,  316.2       ,  330.8       , ..., 2033.53333333,
       2037.13333333, 2041.46666667])

Return the index of the leaf that each sample is predicted as

In [11]:

dt_rgs.apply(year.reshape(-1,1))

Out[11]:

array([ 11,  13,  14, ..., 429, 431, 432])

In [12]:

dt_rgs.decision_path(year.reshape(-1,1))

Out[12]:

<3253x433 sparse matrix of type '<class 'numpy.int64'>'
	with 31915 stored elements in Compressed Sparse Row format>

Get parameter of regresser Tree

In [13]:

dt_rgs.get_params(deep=True)

Out[13]:

{'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 0,
 'splitter': 'best'}

Return the mean accuracy

In [14]:

dt_rgs.score(year.reshape(-1,1),expectancy)

Out[14]:

0.7880874805337345

In [15]:

arr=[2045]
dt_rgs.predict(np.array(arr).reshape(-1,1))

Out[15]:

array([2041.46666667])

Load Breast cancer libarary (Buield in) in sklearn

In [16]:

from sklearn.datasets import load_breast_cancer
data
=load_breast_cancer()
data.target[[10, 50, 85]]

Out[16]:

array([0, 1, 0])

In [17]:

list(data.target_names)

Out[17]:

['malignant', 'benign']

Perform Data analysis on breast-cancer-wisconsin.csv

In [106]:

dataset = pd.read_csv('/home/webtunix/Downloads/breast-cancer-wisconsin.csv')
X = dataset.iloc[:, :6].values
Y
= dataset.iloc[:,6:11].values
print
(X[:30,:1])
[[1000025]
 [1002945]
 [1015425]
 [1016277]
 [1017023]
 [1017122]
 [1018099]
 [1018561]
 [1033078]
 [1033078]
 [1035283]
 [1036172]
 [1041801]
 [1043999]
 [1044572]
 [1047630]
 [1048672]
 [1049815]
 [1050670]
 [1050718]
 [1054590]
 [1054593]
 [1056784]
 [1057013]
 [1059552]
 [1065726]
 [1066373]
 [1066979]
 [1067444]
 [1070935]]

Get two subdataset(X,Y) of above dataset and print them.

In [107]:

print(Y)
[['1' 3 1 1 2]
 ['10' 3 2 1 2]
 ['2' 3 1 1 2]
 ...
 ['3' 8 10 2 4]
 ['4' 10 6 1 4]
 ['5' 10 4 1 4]]

Get 10 head values of dataset

In [127]:

dataset.head(10)

Out[127]:

  id number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
5 1017122 8 10 10 8 7 10 9 7 1 4
6 1018099 1 1 1 1 2 10 3 1 1 2
7 1018561 2 1 2 1 2 1 3 1 1 2
8 1033078 2 1 1 1 2 1 1 1 5 2
9 1033078 4 2 1 1 2 1 2 1 1 2


In [109]:Print dataset dimensions/ shape

print("Cancer data set dimensions : {}".format(dataset.shape))
Cancer data set dimensions : (699, 11)

Check any null value present in data set or not

In [110]:

dataset.isnull().sum()
dataset.isna().sum()

Out[110]:

id number                      0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

Describe the dataset

In [111]:

dataset.describe()

Out[111]:

  id number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bland Chromatin Normal Nucleoli Mitoses Class
count 6.990000e+02 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000
mean 1.071704e+06 4.417740 3.134478 3.207439 2.806867 3.216023 3.437768 2.866953 1.589413 2.689557
std 6.170957e+05 2.815741 3.051459 2.971913 2.855379 2.214300 2.438364 3.053634 1.715078 0.951273
min 6.163400e+04 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
25% 8.706885e+05 2.000000 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000 2.000000
50% 1.171710e+06 4.000000 1.000000 1.000000 1.000000 2.000000 3.000000 1.000000 1.000000 2.000000
75% 1.238298e+06 6.000000 5.000000 5.000000 4.000000 4.000000 5.000000 4.000000 1.000000 4.000000
max 1.345435e+07 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 4.000000

In [121]:print values of subdataset of cancer dataset and their shape


m= dataset.drop(['Class'], axis=1, inplace=False)
print('m Data is \n' , m.head())
print('m shape is ' , m.shape)
m Data is 
    id number  Clump Thickness  Uniformity of Cell Size   \
0    1000025                5                         1   
1    1002945                5                         4   
2    1015425                3                         1   
3    1016277                6                         8   
4    1017023                4                         1   

   Uniformity of Cell Shape  Marginal Adhesion   Single Epithelial Cell Size  \
0                         1                   1                            2   
1                         4                   5                            7   
2                         1                   1                            2   
3                         8                   1                            3   
4                         1                   3                            2   

  Bare Nuclei  Bland Chromatin   Normal Nucleoli    Mitoses  
0           1                 3                  1        1  
1          10                 3                  2        1  
2           2                 3                  1        1  
3           4                 3                  7        1  
4           1                 3                  1        1  
m shape is  (699, 10)

In [122]:

n = dataset['Class']
print('n Data is \n' , n.head())
print('n shape is ' , n.shape)
n Data is 
 0    2
1    2
2    2
3    2
4    2
Name: Class, dtype: int64
n shape is  (699,)

Perform SimpleImputer to Imputation transformer for completing missing values

In [123]:

from sklearn.impute import SimpleImputer
ImputedModule
= SimpleImputer(missing_values = np.nan, strategy ='most_frequent')
ImputedX = ImputedModule.fit(m)
m = ImputedX.transform(m)
print('m Data is \n' , m[:570])
print('\n n Data is \n' , n[:570])
m Data is 
 [[1000025 5 1 ... 3 1 1]
 [1002945 5 4 ... 3 2 1]
 [1015425 3 1 ... 3 1 1]
 ...
 [1334071 4 1 ... 2 1 1]
 [1343068 8 4 ... 2 5 2]
 [1343374 10 10 ... 10 3 1]]

 n Data is 
 0      2
1      2
2      2
3      2
4      2
      ..
565    4
566    2
567    2
568    4
569    4
Name: Class, Length: 570, dtype: int64

Train and test the subdataset and print their shape

In [124]:

from sklearn.model_selection import train_test_split
m_train, m_test, n_train, n_test = train_test_split(m, n, test_size=0.33, random_state=44, shuffle =True)

In [125]:

print('m_train shape is ' , m_train.shape)
print('m_test shape is ' , m_test.shape)
print('n_train shape is ' , n_train.shape)
print('n_test shape is ' , n_test.shape)
m_train shape is  (468, 10)
m_test shape is  (231, 10)
n_train shape is  (468,)
n_test shape is  (231,)

Conclusion

Perform classifier and regresser Decission tree technique and their operation and learn how decision tree classified the data after that Analyse the Breast Cancer dataset and test and train that dataset.

In [ ]:

 

Improve your Business Analytics with our training data.

Better data is the key for the better products. We train you data for Machine Learning and better business analytics. We can annotate, collect, evaluate and translate any type of data in any language.