[빅데이터] 데이터를 이용한 예측, 분류

빅데이터

[빅데이터] 데이터를 이용한 예측, 분류

allempty_sheep 2024. 6. 18. 16:48

🎁 본 글은 실무로 '배우는 빅데이터기술' 책을 따라해보고 실행하여보는 과정을 기록한 글이다.

🎁 빅데이터 처리의 전체적인 흐름과 과정을 학습하기 쉬우며 빅데이터에 관심있는 사람들에게 추천한다.

◼ Hive 쿼리 에디터로 아래 내용을 돌려 데이터셋을 가공한다.

insert overwrite local directory '/home/pilot-pjt/spark-data/classification/input'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select 
  sex, age, marriage, region, job, car_capacity, car_year, car_model,
  tire_fl, tire_fr, tire_bl, tire_br, light_fl, light_fr, light_bl, light_br,
  engine, break, battery,
  case when ((tire_fl_s  + tire_fr_s  + tire_bl_s  + tire_br_s  + 
              light_fl_s + light_fr_s + light_bl_s + light_br_s + 
              engine_s   + break_s    + battery_s  + 
              car_capacity_s + car_year_s + car_model_s) < 6) 
       then '비정상' else '정상' 
  end as status
from (
  select 
    sex, age, marriage, region, job, car_capacity, car_year, car_model,
    tire_fl, tire_fr, tire_bl, tire_br, light_fl, light_fr, light_bl, light_br,
    engine, break, battery,

    case
	 when (1500 > cast(car_capacity as int)) then -0.3 
        when (2000 > cast(car_capacity as int)) then -0.2 
        else -0.1
    end as car_capacity_s ,

    case
	when (2005 > cast(car_year as int)) then -0.3 
       when (2010 > cast(car_year as int)) then -0.2 
       else -0.1
    end as car_year_s ,

    case
	when ('B' = car_model) then -0.3
       when ('D' = car_model) then -0.3 
       when ('F' = car_model) then -0.3 
       when ('H' = car_model) then -0.3 
       else 0.0
    end as car_model_s ,

    case 
       when (10 > cast(tire_fl as int)) then 0.1 
       when (20 > cast(tire_fl as int)) then 0.2 
       when (40 > cast(tire_fl as int)) then 0.4 
       else 0.5
    end as tire_fl_s ,

    case 
       when (10 > cast(tire_fr as int)) then 0.1 
       when (20 > cast(tire_fr as int)) then 0.2 
       when (40 > cast(tire_fr as int)) then 0.4 
       else 0.5
    end as tire_fr_s ,

    case 
       when (10 > cast(tire_bl as int)) then 0.1 
       when (20 > cast(tire_bl as int)) then 0.2 
       when (40 > cast(tire_bl as int)) then 0.4 
       else 0.5
    end as tire_bl_s ,

    case 
       when (10 > cast(tire_br as int)) then 0.1 
       when (20 > cast(tire_br as int)) then 0.2 
       when (40 > cast(tire_br as int)) then 0.4 
       else 0.5
    end as tire_br_s ,

    case when (cast(light_fl as int) = 2) then 0.0 else 0.5 end as light_fl_s ,
    case when (cast(light_fr as int) = 2) then 0.0 else 0.5 end as light_fr_s , 
    case when (cast(light_bl as int) = 2) then 0.0 else 0.5 end as light_bl_s ,
    case when (cast(light_br as int) = 2) then 0.0 else 0.5 end as light_br_s , 

    case 
       when (engine = 'A') then 1.0 
       when (engine = 'B') then 0.5 
       when (engine = 'C') then 0.0
    end as engine_s ,

    case 
       when (break = 'A') then 1.0 
       when (break = 'B') then 0.5 
       when (break = 'C') then 0.0
    end as break_s ,

    case 
       when (20 > cast(battery as int)) then 0.2 
       when (40 > cast(battery as int)) then 0.4 
       when (60 > cast(battery as int)) then 0.6 
       else 1.0
    end as battery_s 

  from managed_smartcar_status_info ) T1

◼ 생성된 파일을 아래 경로에서 확인 해 보자.

server02

more /home/pilot-pjt/spark-data/classification/input/*

◼ 해당 위치에서 파일 목록을 확인 할 수 있다.

cd /home/pilot-pjt/spark-data/classification/input
ls -al

◼ 현재 파일이 두개 나와있는 것을 합쳐 줄 것이다. (000000_0, 000001_0)

cat 000000_0 000001_0 > classification_dataset.txt

합쳐진 데이터 파일을 볼 수 있다.

◼ 폴더를 하나 만들어 준다.

hdfs dfs -mkdir -p /pilot-pjt/spark-data/classification/input

◼ 폴더안에 만든 파일을 넣어 준다.

hdfs dfs -put /home/pilot-pjt/spark-data/classification/input/classification_dataset.txt /pilot-pjt/spark-data/classification/input

◼ 제플린을 재시작 해준다.

zeppelin-daemon.sh restart

◼ 노트를 새로 하나 만들어주고 코드를 실행 시켜 볼 것이다.

//그림 7.62 스파크ML의 라이브러리 임포트------------------------------------------------------

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer, StringIndexerModel, VectorAssembler}
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.evaluation.MulticlassMetrics 
import org.apache.spark.mllib.util.MLUtils

//그림 7.63 스파크ML의 학습 데이터 로드------------------------------------------------------

val ds = spark.read.csv("/pilot-pjt/spark-data/classification/input/classification_dataset.txt")
ds.show(5)

//그림 7.64 스파크ML에서 사용할 칼럼 선택------------------------------------------------------

val dsSmartCar = ds.selectExpr("cast(_c5 as long) car_capacity", 
                        "cast(_c6 as long) car_year",
                        "cast(_c7 as string) car_model",
                        "cast(_c8 as int) tire_fl",
                        "cast(_c9 as long) tire_fr",
                        "cast(_c10 as long) tire_bl",
                        "cast(_c11 as long) tire_br",
                        "cast(_c12 as long) light_fl",
                        "cast(_c13 as long) light_fr",
                        "cast(_c14 as long) light_bl",
                        "cast(_c15 as long) light_br",
                        "cast(_c16 as string) engine",
                        "cast(_c17 as string) break",
                        "cast(_c18 as long) battery",
                        "cast(_c19 as string) status"
                       )


//그림 7.65 범주형 칼럼을 연속형(숫자형) 칼럼으로 변환 및 생성------------------------------------------------------

val dsSmartCar_1 = new StringIndexer().setInputCol("car_model").setOutputCol("car_model_n").fit(dsSmartCar).transform(dsSmartCar)
val dsSmartCar_2 = new StringIndexer().setInputCol("engine").setOutputCol("engine_n").fit(dsSmartCar_1).transform(dsSmartCar_1)
val dsSmartCar_3 = new StringIndexer().setInputCol("break").setOutputCol("break_n").fit(dsSmartCar_2).transform(dsSmartCar_2)
val dsSmartCar_4 = new StringIndexer().setInputCol("status").setOutputCol("label").fit(dsSmartCar_3).transform(dsSmartCar_3)
val dsSmartCar_5 = dsSmartCar_4.drop("car_model").drop("engine").drop("break").drop("status")

dsSmartCar_5.show()


//그림 7.66 스파크ML에 사용할 피처 변수 작업------------------------------------------------------

val cols = Array("car_capacity", "car_year", "car_model_n", "tire_fl",
                 "tire_fr", "tire_bl", "tire_br", "light_fl", "light_fr", 
                 "light_bl", "light_br", "engine_n", "break_n", "battery")

val dsSmartCar_6 = new VectorAssembler().setInputCols(cols).setOutputCol("features").transform(dsSmartCar_5)
val dsSmartCar_7 = new MinMaxScaler().setInputCol("features").setOutputCol("scaledFeatures").fit(dsSmartCar_6).transform(dsSmartCar_6)
val dsSmartCar_8 = dsSmartCar_7.drop("features").withColumnRenamed("scaledfeatures", "features")
dsSmartCar_8.show()

//그림 7.67 머신러닝 학습용 데이터를 LibSVM 형식으로 저장------------------------------------------------------

val dsSmartCar_9 = dsSmartCar_8.select("label", "features")
dsSmartCar_9.write.format("libsvm").save("/pilot-pjt/spark-data/classification/smartCarLibSVM")

//그림 7.69 LibSVM 형식의 머신러닝 학습용 데이터 확인 및 로드------------------------------------------------------

val dsSmartCar_10 = spark.read.format("libsvm").load("/pilot-pjt/spark-data/classification/smartCarLibSVM")
dsSmartCar_10.show(5)

//그림 7.70 Training 및 Test 데이터셋 생성------------------------------------------------------

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(dsSmartCar_10)

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .fit(dsSmartCar_10)

val Array(trainingData, testData) = dsSmartCar_10.randomSplit(Array(0.7, 0.3))

//그림 7.71 스마트카의 상태 정보 예측을 위한 랜덤 포레스트 모델 학습------------------------------------------------------

val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(3)
  
  
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)
  
  
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))


val model = pipeline.fit(trainingData)

//그림 7.72 랜덤 포레스트 모델의 설명력 확인------------------------------------------------------


val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"RandomForest Model Description :\n ${rfModel.toDebugString}")

//그림 7.73 랜덤 포레스트 모델 평가기 실행------------------------------------------------------



val predictions = model.transform(testData)
predictions.select("predictedLabel", "label", "features").show(5)

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)


//그림 7.74 스마트카 상태 예측 모델 평가 – 정확도------------------------------------------------------


println(s"@ Accuracy Rate = ${(accuracy)}")
println(s"@ Error Rate = ${(1.0 - accuracy)}")


//그림 7.75 스마트카 상태 예측 모델 평가 – Confusion Matrix 실행------------------------------------------------------

val results = model.transform(testData).select("features", "label", "prediction")
val predictionAndLabels = results.select($"prediction",$"label").as[(Double, Double)].rdd

val bMetrics = new BinaryClassificationMetrics(predictionAndLabels)
val mMetrics = new MulticlassMetrics(predictionAndLabels)
val labels = mMetrics.labels

println("Confusion Matrix:")
println(mMetrics.confusionMatrix)

//그림 7.77 스마트카 상태 예측 모델 평가 – Precision(정밀도)------------------------------------------------------

 labels.foreach { rate =>
    println(s"@ Precision Rate($rate) = " + mMetrics.precision(rate))
 }

//그림 7.78 스마트카 상태 예측 모델 평가 – Recall(재현율)------------------------------------------------------
 labels.foreach { rate =>
    println(s"Recall Rate($rate) = " + mMetrics.recall(rate))
 }
 labels.foreach { rate =>
   println(s"False Positive Rate($rate) = " + mMetrics.falsePositiveRate(rate))
 }

//그림 7.79 스마트카 상태 예측 모델 평가 – F1-Score------------------------------------------------------
 labels.foreach { rate =>
   println(s"F1-Score($rate) = " + mMetrics.fMeasure(rate))
 }

아래와 같은 결과값이 나왔다.

+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
|_c0|_c1| _c2| _c3| _c4| _c5| _c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|
+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
| 남| 24|기혼|인천|학생|2500|2008|  E| 70| 86|  85|  93|   1|   1|   1|   1|   B|   A|  95|정상|
| 남| 24|기혼|인천|학생|2500|2008|  E| 74| 82|  70|  95|   1|   1|   1|   1|   A|   B|  84|정상|
| 남| 24|기혼|인천|학생|2500|2008|  E| 98| 80|  78|  88|   1|   1|   1|   1|   A|   A|  86|정상|
| 남| 24|기혼|인천|학생|2500|2008|  E| 70| 74|  78|  90|   1|   1|   1|   1|   A|   A| 100|정상|
| 남| 24|기혼|인천|학생|2500|2008|  E| 73| 90|  81|  82|   1|   1|   1|   1|   A|   A|  52|정상|
+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
only showing top 5 rows

+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
|car_capacity|car_year|tire_fl|tire_fr|tire_bl|tire_br|light_fl|light_fr|light_bl|light_br|battery|car_model_n|engine_n|break_n|label|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
|        2500|    2008|     70|     86|     85|     93|       1|       1|       1|       1|     95|        0.0|     1.0|    0.0|  0.0|
|        2500|    2008|     74|     82|     70|     95|       1|       1|       1|       1|     84|        0.0|     0.0|    1.0|  0.0|
|        2500|    2008|     98|     80|     78|     88|       1|       1|       1|       1|     86|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     70|     74|     78|     90|       1|       1|       1|       1|    100|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     73|     90|     81|     82|       1|       1|       1|       1|     52|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     97|     70|     96|     95|       1|       1|       1|       1|     92|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     93|     86|     86|     89|       1|       1|       1|       1|     98|        0.0|     0.0|    1.0|  0.0|
|        2500|    2008|     80|     79|     87|     91|       1|       1|       1|       1|     93|        0.0|     1.0|    0.0|  0.0|
|        2500|    2008|     93|     73|     73|     93|       1|       1|       1|       1|    100|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     91|     90|     92|     76|       1|       1|       1|       1|     87|        0.0|     1.0|    1.0|  1.0|
|        2500|    2008|     85|     74|     71|     76|       1|       1|       1|       1|     59|        0.0|     0.0|    1.0|  1.0|
|        2500|    2008|     86|     90|     94|     70|       1|       1|       1|       1|     90|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     88|     84|     94|     88|       1|       1|       1|       1|     93|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     95|     94|     78|     92|       1|       1|       1|       1|     98|        0.0|     1.0|    1.0|  1.0|
|        2500|    2008|    100|     97|     78|     97|       1|       1|       1|       1|     93|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     82|     98|     81|     93|       1|       1|       1|       1|     97|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|    100|     99|     76|     85|       1|       1|       1|       1|     91|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     93|    100|     89|     71|       1|       1|       1|       1|     81|        0.0|     0.0|    0.0|  0.0|
|        2500|    2008|     71|     89|    100|     73|       1|       1|       1|       1|     84|        0.0|     1.0|    0.0|  0.0|
|        2500|    2008|     98|     89|     93|     99|       1|       1|       1|       1|     95|        0.0|     0.0|    0.0|  0.0|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
only showing top 20 rows

+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
|car_capacity|car_year|tire_fl|tire_fr|tire_bl|tire_br|light_fl|light_fr|light_bl|light_br|battery|car_model_n|engine_n|break_n|label|            features|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
|        2500|    2008|     70|     86|     85|     93|       1|       1|       1|       1|     95|        0.0|     1.0|    0.0|  0.0|[0.6,0.5,0.0,0.66...|
|        2500|    2008|     74|     82|     70|     95|       1|       1|       1|       1|     84|        0.0|     0.0|    1.0|  0.0|[0.6,0.5,0.0,0.71...|
|        2500|    2008|     98|     80|     78|     88|       1|       1|       1|       1|     86|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.97...|
|        2500|    2008|     70|     74|     78|     90|       1|       1|       1|       1|    100|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.66...|
|        2500|    2008|     73|     90|     81|     82|       1|       1|       1|       1|     52|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.7,...|
|        2500|    2008|     97|     70|     96|     95|       1|       1|       1|       1|     92|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.96...|
|        2500|    2008|     93|     86|     86|     89|       1|       1|       1|       1|     98|        0.0|     0.0|    1.0|  0.0|[0.6,0.5,0.0,0.92...|
|        2500|    2008|     80|     79|     87|     91|       1|       1|       1|       1|     93|        0.0|     1.0|    0.0|  0.0|[0.6,0.5,0.0,0.77...|
|        2500|    2008|     93|     73|     73|     93|       1|       1|       1|       1|    100|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.92...|
|        2500|    2008|     91|     90|     92|     76|       1|       1|       1|       1|     87|        0.0|     1.0|    1.0|  1.0|[0.6,0.5,0.0,0.9,...|
|        2500|    2008|     85|     74|     71|     76|       1|       1|       1|       1|     59|        0.0|     0.0|    1.0|  1.0|[0.6,0.5,0.0,0.83...|
|        2500|    2008|     86|     90|     94|     70|       1|       1|       1|       1|     90|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.84...|
|        2500|    2008|     88|     84|     94|     88|       1|       1|       1|       1|     93|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.86...|
|        2500|    2008|     95|     94|     78|     92|       1|       1|       1|       1|     98|        0.0|     1.0|    1.0|  1.0|[0.6,0.5,0.0,0.94...|
|        2500|    2008|    100|     97|     78|     97|       1|       1|       1|       1|     93|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,1.0,...|
|        2500|    2008|     82|     98|     81|     93|       1|       1|       1|       1|     97|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.8,...|
|        2500|    2008|    100|     99|     76|     85|       1|       1|       1|       1|     91|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,1.0,...|
|        2500|    2008|     93|    100|     89|     71|       1|       1|       1|       1|     81|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.92...|
|        2500|    2008|     71|     89|    100|     73|       1|       1|       1|       1|     84|        0.0|     1.0|    0.0|  0.0|[0.6,0.5,0.0,0.67...|
|        2500|    2008|     98|     89|     93|     99|       1|       1|       1|       1|     95|        0.0|     0.0|    0.0|  0.0|[0.6,0.5,0.0,0.97...|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
only showing top 20 rows

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(14,[0,1,3,4,5,6,...|
|  0.0|(14,[0,1,3,4,5,6,...|
|  0.0|(14,[0,1,3,4,5,6,...|
|  0.0|(14,[0,1,3,4,5,6,...|
|  0.0|(14,[0,1,3,4,5,6,...|
+-----+--------------------+
only showing top 5 rows

RandomForest Model Description :
 RandomForestClassificationModel (uid=rfc_2681b3e83f64) with 3 trees
  Tree 0 (weight 1.0):
    If (feature 12 in {1.0,2.0})
     If (feature 12 in {2.0})
      Predict: 1.0
     Else (feature 12 not in {2.0})
      If (feature 2 in {0.0,1.0,5.0,6.0})
       If (feature 11 in {0.0,2.0})
        If (feature 11 in {2.0})
         Predict: 1.0
        Else (feature 11 not in {2.0})
         Predict: 0.0
       Else (feature 11 not in {0.0,2.0})
        Predict: 1.0
      Else (feature 2 not in {0.0,1.0,5.0,6.0})
       Predict: 1.0
    Else (feature 12 not in {1.0,2.0})
     If (feature 7 in {1.0})
      If (feature 10 in {1.0})
       Predict: 1.0
      Else (feature 10 not in {1.0})
       If (feature 2 in {6.0})
        Predict: 0.0
       Else (feature 2 not in {6.0})
        Predict: 1.0
     Else (feature 7 not in {1.0})
      If (feature 9 in {1.0})
       If (feature 3 <= 0.6722222222222223)
        If (feature 2 in {0.0,1.0,2.0,6.0})
         Predict: 0.0
        Else (feature 2 not in {0.0,1.0,2.0,6.0})
         Predict: 1.0
       Else (feature 3 > 0.6722222222222223)
        If (feature 2 in {0.0,1.0,2.0,6.0})
         Predict: 0.0
        Else (feature 2 not in {0.0,1.0,2.0,6.0})
         Predict: 1.0
      Else (feature 9 not in {1.0})
       If (feature 11 in {0.0,2.0})
        If (feature 8 in {1.0})
         Predict: 1.0
        Else (feature 8 not in {1.0})
         Predict: 0.0
       Else (feature 11 not in {0.0,2.0})
        If (feature 13 <= 0.6377551020408163)
         Predict: 1.0
        Else (feature 13 > 0.6377551020408163)
         Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 12 in {1.0,2.0})
     If (feature 2 in {0.0,1.0,6.0})
      If (feature 0 in {2.0,4.0,5.0,6.0,7.0})
       If (feature 4 <= 0.6989795918367347)
        If (feature 0 in {2.0,4.0,5.0,7.0})
         Predict: 0.0
        Else (feature 0 not in {2.0,4.0,5.0,7.0})
         Predict: 1.0
       Else (feature 4 > 0.6989795918367347)
        If (feature 9 in {1.0})
         Predict: 1.0
        Else (feature 9 not in {1.0})
         Predict: 0.0
      Else (feature 0 not in {2.0,4.0,5.0,6.0,7.0})
       If (feature 6 <= 0.6818181818181819)
        Predict: 1.0
       Else (feature 6 > 0.6818181818181819)
        If (feature 11 in {1.0,2.0})
         Predict: 1.0
        Else (feature 11 not in {1.0,2.0})
         Predict: 0.0
     Else (feature 2 not in {0.0,1.0,6.0})
      If (feature 2 in {5.0,7.0})
       If (feature 12 in {2.0})
        Predict: 1.0
       Else (feature 12 not in {2.0})
        If (feature 6 <= 0.6818181818181819)
         Predict: 0.0
        Else (feature 6 > 0.6818181818181819)
         Predict: 1.0
      Else (feature 2 not in {5.0,7.0})
       Predict: 1.0
    Else (feature 12 not in {1.0,2.0})
     If (feature 8 in {1.0})
      If (feature 1 in {16.0})
       If (feature 13 <= 0.6377551020408163)
        If (feature 5 <= 0.995)
         Predict: 1.0
        Else (feature 5 > 0.995)
         Predict: 0.0
       Else (feature 13 > 0.6377551020408163)
        Predict: 0.0
      Else (feature 1 not in {16.0})
       If (feature 3 <= 0.7055555555555555)
        Predict: 1.0
       Else (feature 3 > 0.7055555555555555)
        If (feature 0 in {3.0,4.0,6.0})
         Predict: 0.0
        Else (feature 0 not in {3.0,4.0,6.0})
         Predict: 1.0
     Else (feature 8 not in {1.0})
      If (feature 10 in {1.0})
       If (feature 7 in {1.0})
        Predict: 1.0
       Else (feature 7 not in {1.0})
        If (feature 1 in {9.0,15.0,16.0})
         Predict: 0.0
        Else (feature 1 not in {9.0,15.0,16.0})
         Predict: 1.0
      Else (feature 10 not in {1.0})
       If (feature 9 in {1.0})
        Predict: 1.0
       Else (feature 9 not in {1.0})
        Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 7 in {1.0})
     If (feature 9 in {1.0})
      Predict: 1.0
     Else (feature 9 not in {1.0})
      If (feature 2 in {0.0,6.0})
       If (feature 1 in {3.0,12.0})
        If (feature 0 in {4.0})
         Predict: 0.0
        Else (feature 0 not in {4.0})
         Predict: 1.0
       Else (feature 1 not in {3.0,12.0})
        If (feature 12 in {0.0})
         Predict: 0.0
        Else (feature 12 not in {0.0})
         Predict: 1.0
      Else (feature 2 not in {0.0,6.0})
       Predict: 1.0
    Else (feature 7 not in {1.0})
     If (feature 8 in {1.0})
      If (feature 12 in {1.0})
       Predict: 1.0
      Else (feature 12 not in {1.0})
       If (feature 2 in {1.0,6.0})
        If (feature 11 in {0.0})
         Predict: 0.0
        Else (feature 11 not in {0.0})
         Predict: 1.0
       Else (feature 2 not in {1.0,6.0})
        Predict: 1.0
     Else (feature 8 not in {1.0})
      If (feature 11 in {1.0,2.0})
       If (feature 12 in {1.0,2.0})
        Predict: 1.0
       Else (feature 12 not in {1.0,2.0})
        If (feature 2 in {0.0,1.0,5.0,6.0})
         Predict: 0.0
        Else (feature 2 not in {0.0,1.0,5.0,6.0})
         Predict: 1.0
      Else (feature 11 not in {1.0,2.0})
       Predict: 0.0

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           0.0|  0.0|(14,[0,1,2,3,4,5,...|
|           0.0|  0.0|(14,[0,1,2,3,4,5,...|
|           0.0|  0.0|(14,[0,1,2,3,4,5,...|
|           0.0|  0.0|(14,[0,1,2,3,4,5,...|
|           0.0|  0.0|(14,[0,1,2,3,4,5,...|
+--------------+-----+--------------------+
only showing top 5 rows

@ Accuracy Rate = 0.9149558797601713
@ Error Rate = 0.08504412023982866
Confusion Matrix:
1117715.0  37138.0   
100817.0   366488.0  
@ Precision Rate(0.0) = 0.9172635597587917
@ Precision Rate(1.0) = 0.9079890789988752
Recall Rate(0.0) = 0.9678417945833798
Recall Rate(1.0) = 0.7842586747413359
False Positive Rate(0.0) = 0.21574132525866405
False Positive Rate(1.0) = 0.03215820541662012
F1-Score(0.0) = 0.9418741586384004
F1-Score(1.0) = 0.8416005401116734
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer, StringIndexerModel, VectorAssembler}
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.util.MLUtils
ds: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 18 more fields]
dsSmartCar: org.apache.spark.sql.DataFrame = [car_capacity: bigint, car_year: bigint ... 13 more fields]
dsSmartCar_1: org.apache.spark.sql.DataFrame = [car_capacity: bi...

결과값

Precision - 비정상으로 분류한 스마트카 중 실제 비정상인 스마트카 비율

Recall - 실제 비정상 스마트카 중에서 모델이 비정상 스마트카로 분류한 비율

F1-Score - Precision 과 Recall 의 조화 평균

코드를 진행하다가 아래와 같은 에러가 나올 수 있는데 다음과 같은 방법들을 실행 해 보자.

java.lang.NullPointerException at org.apache.thrift.transport.TSocket.open(TSocket.java:170)

◼ 제플린 실행시 오류 발생하면 클라우데라 로그 삭제

rm -rf /var/log/cloudera-scm-server/*

◼ hdfs 안전모드 해제

hdfs dfsadmin -safemode leave

◼ cluster1 > 재시작

zeppelin-daemon.sh restart

'빅데이터' 카테고리의 다른 글

[빅데이터] 군집 분석 (2)	2024.06.19
[빅데이터] 머하웃 사용해 보기. (0)	2024.06.18
[빅데이터] 제플린 사용해 보기. (0)	2024.06.18
[빅데이터] Impala 사용해 보기. (2)	2024.06.18
[빅데이터] 임팔라, 스쿱, 머하웃 설치 With Cloudera, 제플린 설치 With Linux (0)	2024.06.18

현재글[빅데이터] 데이터를 이용한 예측, 분류

개발 일지