랜덤 포리스트 random forest [R 마이닝]

R 데이터 분석

랜덤 포리스트 random forest [R 마이닝]

YONG_X 2013. 6. 4. 00:34

# 전용준 :: 리비젼컨설팅 :: 02-415-7650 :: revision.co.kr :: xyxonxyxon@empal.com

#==================================================

Random Forest in R

Random Forest is an ensemble classifier of decision trees.

랜덤 포리스트는 디시젼 트리 앙상블 분류기이다.

Wikipedia has a nice explanation on the learning algorithm.

위키에 이 학습 알고리즘에 대한 좋은 설명이 있다.

The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n.

일정수의 샘플을 사용해서 일련의 모델을 만드는 배깅 방법이 핵심이다.

Plus, when choosing criteria for decision tree node splits, one feature is chosen from random subset of features.

그리고, 트리 노드의 분기를 위해서 필요한 기준 설정을 할때 무작위 추출된 데이터 부분 집합이 사용되는 것이 하나의 특성이다.

Each decision tree is fully grown (no pruning).

각각의 트리는 가지치기 없이 가능한 최대로 만들어진다.

When making a prediction, mode of predictions of all tress or average of them are used.

R has a randomForest package for it.

예측 실행시 모든 트리를 사용한 예측이나 트리들의 예측의 평균이 사용되며, R은 randomForest라는 패키지를 가지고 있다.

Here, we’ll use iris.

아이리스 데이터를 사용할 것이다.

> head(iris) 

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species 

1          5.1         3.5          1.4         0.2  setosa 

2          4.9         3.0          1.4         0.2  setosa 

3          4.7         3.2          1.3         0.2  setosa 

4          4.6         3.1          1.5         0.2  setosa 

5          5.0         3.6          1.4         0.2  setosa 

6          5.4         3.9          1.7         0.4  setosa

Prepare training and test set.

데이터 셋을 먼저 준비함

> test = iris[ c(1:10, 51:60, 101:110), ]

> train = iris[ c(11:50, 61:100, 111:150), ]

Build a random forest using randomForest().

랜덤포리스트 함수를 사용해서 랜덤 포리스트를 생성한다.

> r = randomForest(Species ~., data=train, importance=TRUE, do.trace=100) 

ntree      OOB      1      2      3 

  100:   5.83%  0.00%  7.50% 10.00% 

  200:   5.83%  0.00% 10.00%  7.50% 

  300:   5.83%  0.00% 10.00%  7.50% 

  400:   5.00%  0.00% 10.00%  5.00% 

  500:   5.83%  0.00% 10.00%  7.50% 

*** 주석: 함수를 사용한다면 일반적인 트리 생성에서와차이나는 부분은 trace에 대해 조건을 주는 부분만 다르다.
  
> print(r) 

Call: 

 randomForest(formula = Species ~ ., data = train, importance = TRUE,      do.trace = 100)  

               Type of random forest: classification 

                     Number of trees: 500 

No. of variables tried at each split: 2 

        OOB estimate of  error rate: 5.83% 

Confusion matrix: 

           setosa versicolor virginica class.error 

setosa         40          0         0       0.000 

versicolor      0         36         4       0.100 

virginica       0          3        37       0.075

We see that predicting setosa works great. Let’s see how it works with test set.

setosa가 제대로 예측되고 있음을 볼 수 있다.

테스트 셋을 사용해서 얼마나 잘 되었는지를 살펴보자.

> iris.predict = predict(r, test) 

> iris.predict 

         1          2          3          4          5          6          7  

    setosa     setosa     setosa     setosa     setosa     setosa     setosa  

         8          9         10         51         52         53         54  

    setosa     setosa     setosa versicolor versicolor versicolor versicolor  

        55         56         57         58         59         60        101  

versicolor versicolor versicolor versicolor versicolor versicolor  virginica  

       102        103        104        105        106        107        108  

 virginica  virginica  virginica  virginica  virginica versicolor  virginica  

       109        110  

 virginica  virginica  

Levels: setosa versicolor virginica 

> t = table(observed=test[,'Species'], predict=iris.predict) 

> t 

            predict 

observed     setosa versicolor virginica 

  setosa         10          0         0 

  versicolor      0         10         0 

  virginica       0          1         9 

> prop.table(t, 1) 

            predict 

observed     setosa versicolor virginica 

  setosa        1.0        0.0       0.0 

  versicolor    0.0        1.0       0.0 

  virginica     0.0        0.1       0.9 

>

As you can see, 10% of virginica was predicted as versicolor.

References)

보시다시피, virginica 10%가 versicolr로 예측되었다.

Package ‘randomForest’, Briedman and Cutler’s random forests for classification and regression.: reference manual.
Andy Liaw and Matthew Wiener, Classification and Regression by randomForest, R News, pp. 18-22.: good examples on classification, regression and clustering(really!), and some practical advice.

*** 주석: http://mkseo.pe.kr/stats/?p=220 에서 퍼와서 번역만 했네요. 랜덤포리스트가 데이터 사이언스에서는 인기가 많으니까요. 많은 분들이 편하게 사용하는데 약간의 수고가 도움되면 좋겠군요. 번역하느라 퍼 온 것에 대해서는 저자 분이 양해하실 것으로 생각하구요.

다음 포스팅에서는 KBO 데이터 셋에 대해 적용한 예제를 만들어 볼까 생각중입니다

저작자표시 비영리 변경금지 (새창열림)

'R 데이터 분석' 카테고리의 다른 글

[R 분석 연습] 2차원 플롯 : 밀집 영역 플로팅 (0)	2013.06.14
랜덤 포리스트 random forest 응용 [R 마이닝 ] (0)	2013.06.04
[빅 데이터, 그리고] 데이터 사이언스와 데이터 마이닝 - Using R <강좌 예고> (0)	2013.05.09
[R 분석] searchtwitter 한글 오류 검토 (0)	2013.04.12
[R 데이터 분석 예제] ggplot2 qplot 활용 (0)	2013.03.29

현재글랜덤 포리스트 random forest [R 마이닝]

리비젼 CRM ( revisioncrm )

R, 리비젼컨설팅, 빅데이터, 전용준, 프롬프트, 데이터 사이언티스트, 머신러닝, 인공지능, 전용준 빅데이터, 프롬프트엔지니어링, 빅 데이터, 데이터분석, 디지털마케팅, chatGPT, 데이터 분석, 리비젼, CRM, 챗GPT, AI, GPT,

Today :
Yesterday :