반응형
🎁 본 글은 실무로 '배우는 빅데이터기술' 책을 따라해보고 실행하여보는 과정을 기록한 글이다.
🎁 빅데이터 처리의 전체적인 흐름과 과정을 학습하기 쉬우며 빅데이터에 관심있는 사람들에게 추천한다.
◼ Hive 쿼리 에디터로 아래 내용을 돌려 데이터셋을 가공한다.
insert overwrite local directory '/home/pilot-pjt/spark-data/classification/input'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select
sex, age, marriage, region, job, car_capacity, car_year, car_model,
tire_fl, tire_fr, tire_bl, tire_br, light_fl, light_fr, light_bl, light_br,
engine, break, battery,
case when ((tire_fl_s + tire_fr_s + tire_bl_s + tire_br_s +
light_fl_s + light_fr_s + light_bl_s + light_br_s +
engine_s + break_s + battery_s +
car_capacity_s + car_year_s + car_model_s) < 6)
then '비정상' else '정상'
end as status
from (
select
sex, age, marriage, region, job, car_capacity, car_year, car_model,
tire_fl, tire_fr, tire_bl, tire_br, light_fl, light_fr, light_bl, light_br,
engine, break, battery,
case
when (1500 > cast(car_capacity as int)) then -0.3
when (2000 > cast(car_capacity as int)) then -0.2
else -0.1
end as car_capacity_s ,
case
when (2005 > cast(car_year as int)) then -0.3
when (2010 > cast(car_year as int)) then -0.2
else -0.1
end as car_year_s ,
case
when ('B' = car_model) then -0.3
when ('D' = car_model) then -0.3
when ('F' = car_model) then -0.3
when ('H' = car_model) then -0.3
else 0.0
end as car_model_s ,
case
when (10 > cast(tire_fl as int)) then 0.1
when (20 > cast(tire_fl as int)) then 0.2
when (40 > cast(tire_fl as int)) then 0.4
else 0.5
end as tire_fl_s ,
case
when (10 > cast(tire_fr as int)) then 0.1
when (20 > cast(tire_fr as int)) then 0.2
when (40 > cast(tire_fr as int)) then 0.4
else 0.5
end as tire_fr_s ,
case
when (10 > cast(tire_bl as int)) then 0.1
when (20 > cast(tire_bl as int)) then 0.2
when (40 > cast(tire_bl as int)) then 0.4
else 0.5
end as tire_bl_s ,
case
when (10 > cast(tire_br as int)) then 0.1
when (20 > cast(tire_br as int)) then 0.2
when (40 > cast(tire_br as int)) then 0.4
else 0.5
end as tire_br_s ,
case when (cast(light_fl as int) = 2) then 0.0 else 0.5 end as light_fl_s ,
case when (cast(light_fr as int) = 2) then 0.0 else 0.5 end as light_fr_s ,
case when (cast(light_bl as int) = 2) then 0.0 else 0.5 end as light_bl_s ,
case when (cast(light_br as int) = 2) then 0.0 else 0.5 end as light_br_s ,
case
when (engine = 'A') then 1.0
when (engine = 'B') then 0.5
when (engine = 'C') then 0.0
end as engine_s ,
case
when (break = 'A') then 1.0
when (break = 'B') then 0.5
when (break = 'C') then 0.0
end as break_s ,
case
when (20 > cast(battery as int)) then 0.2
when (40 > cast(battery as int)) then 0.4
when (60 > cast(battery as int)) then 0.6
else 1.0
end as battery_s
from managed_smartcar_status_info ) T1
◼ 생성된 파일을 아래 경로에서 확인 해 보자.
server02
more /home/pilot-pjt/spark-data/classification/input/*
◼ 해당 위치에서 파일 목록을 확인 할 수 있다.
cd /home/pilot-pjt/spark-data/classification/input
ls -al
◼ 현재 파일이 두개 나와있는 것을 합쳐 줄 것이다. (000000_0, 000001_0)
cat 000000_0 000001_0 > classification_dataset.txt
합쳐진 데이터 파일을 볼 수 있다.
◼ 폴더를 하나 만들어 준다.
hdfs dfs -mkdir -p /pilot-pjt/spark-data/classification/input
◼ 폴더안에 만든 파일을 넣어 준다.
hdfs dfs -put /home/pilot-pjt/spark-data/classification/input/classification_dataset.txt /pilot-pjt/spark-data/classification/input
◼ 제플린을 재시작 해준다.
zeppelin-daemon.sh restart
◼ 노트를 새로 하나 만들어주고 코드를 실행 시켜 볼 것이다.
//그림 7.62 스파크ML의 라이브러리 임포트------------------------------------------------------
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer, StringIndexerModel, VectorAssembler}
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.util.MLUtils
//그림 7.63 스파크ML의 학습 데이터 로드------------------------------------------------------
val ds = spark.read.csv("/pilot-pjt/spark-data/classification/input/classification_dataset.txt")
ds.show(5)
//그림 7.64 스파크ML에서 사용할 칼럼 선택------------------------------------------------------
val dsSmartCar = ds.selectExpr("cast(_c5 as long) car_capacity",
"cast(_c6 as long) car_year",
"cast(_c7 as string) car_model",
"cast(_c8 as int) tire_fl",
"cast(_c9 as long) tire_fr",
"cast(_c10 as long) tire_bl",
"cast(_c11 as long) tire_br",
"cast(_c12 as long) light_fl",
"cast(_c13 as long) light_fr",
"cast(_c14 as long) light_bl",
"cast(_c15 as long) light_br",
"cast(_c16 as string) engine",
"cast(_c17 as string) break",
"cast(_c18 as long) battery",
"cast(_c19 as string) status"
)
//그림 7.65 범주형 칼럼을 연속형(숫자형) 칼럼으로 변환 및 생성------------------------------------------------------
val dsSmartCar_1 = new StringIndexer().setInputCol("car_model").setOutputCol("car_model_n").fit(dsSmartCar).transform(dsSmartCar)
val dsSmartCar_2 = new StringIndexer().setInputCol("engine").setOutputCol("engine_n").fit(dsSmartCar_1).transform(dsSmartCar_1)
val dsSmartCar_3 = new StringIndexer().setInputCol("break").setOutputCol("break_n").fit(dsSmartCar_2).transform(dsSmartCar_2)
val dsSmartCar_4 = new StringIndexer().setInputCol("status").setOutputCol("label").fit(dsSmartCar_3).transform(dsSmartCar_3)
val dsSmartCar_5 = dsSmartCar_4.drop("car_model").drop("engine").drop("break").drop("status")
dsSmartCar_5.show()
//그림 7.66 스파크ML에 사용할 피처 변수 작업------------------------------------------------------
val cols = Array("car_capacity", "car_year", "car_model_n", "tire_fl",
"tire_fr", "tire_bl", "tire_br", "light_fl", "light_fr",
"light_bl", "light_br", "engine_n", "break_n", "battery")
val dsSmartCar_6 = new VectorAssembler().setInputCols(cols).setOutputCol("features").transform(dsSmartCar_5)
val dsSmartCar_7 = new MinMaxScaler().setInputCol("features").setOutputCol("scaledFeatures").fit(dsSmartCar_6).transform(dsSmartCar_6)
val dsSmartCar_8 = dsSmartCar_7.drop("features").withColumnRenamed("scaledfeatures", "features")
dsSmartCar_8.show()
//그림 7.67 머신러닝 학습용 데이터를 LibSVM 형식으로 저장------------------------------------------------------
val dsSmartCar_9 = dsSmartCar_8.select("label", "features")
dsSmartCar_9.write.format("libsvm").save("/pilot-pjt/spark-data/classification/smartCarLibSVM")
//그림 7.69 LibSVM 형식의 머신러닝 학습용 데이터 확인 및 로드------------------------------------------------------
val dsSmartCar_10 = spark.read.format("libsvm").load("/pilot-pjt/spark-data/classification/smartCarLibSVM")
dsSmartCar_10.show(5)
//그림 7.70 Training 및 Test 데이터셋 생성------------------------------------------------------
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(dsSmartCar_10)
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.fit(dsSmartCar_10)
val Array(trainingData, testData) = dsSmartCar_10.randomSplit(Array(0.7, 0.3))
//그림 7.71 스마트카의 상태 정보 예측을 위한 랜덤 포레스트 모델 학습------------------------------------------------------
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(3)
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
val model = pipeline.fit(trainingData)
//그림 7.72 랜덤 포레스트 모델의 설명력 확인------------------------------------------------------
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"RandomForest Model Description :\n ${rfModel.toDebugString}")
//그림 7.73 랜덤 포레스트 모델 평가기 실행------------------------------------------------------
val predictions = model.transform(testData)
predictions.select("predictedLabel", "label", "features").show(5)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
//그림 7.74 스마트카 상태 예측 모델 평가 – 정확도------------------------------------------------------
println(s"@ Accuracy Rate = ${(accuracy)}")
println(s"@ Error Rate = ${(1.0 - accuracy)}")
//그림 7.75 스마트카 상태 예측 모델 평가 – Confusion Matrix 실행------------------------------------------------------
val results = model.transform(testData).select("features", "label", "prediction")
val predictionAndLabels = results.select($"prediction",$"label").as[(Double, Double)].rdd
val bMetrics = new BinaryClassificationMetrics(predictionAndLabels)
val mMetrics = new MulticlassMetrics(predictionAndLabels)
val labels = mMetrics.labels
println("Confusion Matrix:")
println(mMetrics.confusionMatrix)
//그림 7.77 스마트카 상태 예측 모델 평가 – Precision(정밀도)------------------------------------------------------
labels.foreach { rate =>
println(s"@ Precision Rate($rate) = " + mMetrics.precision(rate))
}
//그림 7.78 스마트카 상태 예측 모델 평가 – Recall(재현율)------------------------------------------------------
labels.foreach { rate =>
println(s"Recall Rate($rate) = " + mMetrics.recall(rate))
}
labels.foreach { rate =>
println(s"False Positive Rate($rate) = " + mMetrics.falsePositiveRate(rate))
}
//그림 7.79 스마트카 상태 예측 모델 평가 – F1-Score------------------------------------------------------
labels.foreach { rate =>
println(s"F1-Score($rate) = " + mMetrics.fMeasure(rate))
}
아래와 같은 결과값이 나왔다.
+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
|_c0|_c1| _c2| _c3| _c4| _c5| _c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|_c19|
+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
| 남| 24|기혼|인천|학생|2500|2008| E| 70| 86| 85| 93| 1| 1| 1| 1| B| A| 95|정상|
| 남| 24|기혼|인천|학생|2500|2008| E| 74| 82| 70| 95| 1| 1| 1| 1| A| B| 84|정상|
| 남| 24|기혼|인천|학생|2500|2008| E| 98| 80| 78| 88| 1| 1| 1| 1| A| A| 86|정상|
| 남| 24|기혼|인천|학생|2500|2008| E| 70| 74| 78| 90| 1| 1| 1| 1| A| A| 100|정상|
| 남| 24|기혼|인천|학생|2500|2008| E| 73| 90| 81| 82| 1| 1| 1| 1| A| A| 52|정상|
+---+---+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+----+
only showing top 5 rows
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
|car_capacity|car_year|tire_fl|tire_fr|tire_bl|tire_br|light_fl|light_fr|light_bl|light_br|battery|car_model_n|engine_n|break_n|label|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
| 2500| 2008| 70| 86| 85| 93| 1| 1| 1| 1| 95| 0.0| 1.0| 0.0| 0.0|
| 2500| 2008| 74| 82| 70| 95| 1| 1| 1| 1| 84| 0.0| 0.0| 1.0| 0.0|
| 2500| 2008| 98| 80| 78| 88| 1| 1| 1| 1| 86| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 70| 74| 78| 90| 1| 1| 1| 1| 100| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 73| 90| 81| 82| 1| 1| 1| 1| 52| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 97| 70| 96| 95| 1| 1| 1| 1| 92| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 93| 86| 86| 89| 1| 1| 1| 1| 98| 0.0| 0.0| 1.0| 0.0|
| 2500| 2008| 80| 79| 87| 91| 1| 1| 1| 1| 93| 0.0| 1.0| 0.0| 0.0|
| 2500| 2008| 93| 73| 73| 93| 1| 1| 1| 1| 100| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 91| 90| 92| 76| 1| 1| 1| 1| 87| 0.0| 1.0| 1.0| 1.0|
| 2500| 2008| 85| 74| 71| 76| 1| 1| 1| 1| 59| 0.0| 0.0| 1.0| 1.0|
| 2500| 2008| 86| 90| 94| 70| 1| 1| 1| 1| 90| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 88| 84| 94| 88| 1| 1| 1| 1| 93| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 95| 94| 78| 92| 1| 1| 1| 1| 98| 0.0| 1.0| 1.0| 1.0|
| 2500| 2008| 100| 97| 78| 97| 1| 1| 1| 1| 93| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 82| 98| 81| 93| 1| 1| 1| 1| 97| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 100| 99| 76| 85| 1| 1| 1| 1| 91| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 93| 100| 89| 71| 1| 1| 1| 1| 81| 0.0| 0.0| 0.0| 0.0|
| 2500| 2008| 71| 89| 100| 73| 1| 1| 1| 1| 84| 0.0| 1.0| 0.0| 0.0|
| 2500| 2008| 98| 89| 93| 99| 1| 1| 1| 1| 95| 0.0| 0.0| 0.0| 0.0|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+
only showing top 20 rows
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
|car_capacity|car_year|tire_fl|tire_fr|tire_bl|tire_br|light_fl|light_fr|light_bl|light_br|battery|car_model_n|engine_n|break_n|label| features|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
| 2500| 2008| 70| 86| 85| 93| 1| 1| 1| 1| 95| 0.0| 1.0| 0.0| 0.0|[0.6,0.5,0.0,0.66...|
| 2500| 2008| 74| 82| 70| 95| 1| 1| 1| 1| 84| 0.0| 0.0| 1.0| 0.0|[0.6,0.5,0.0,0.71...|
| 2500| 2008| 98| 80| 78| 88| 1| 1| 1| 1| 86| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.97...|
| 2500| 2008| 70| 74| 78| 90| 1| 1| 1| 1| 100| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.66...|
| 2500| 2008| 73| 90| 81| 82| 1| 1| 1| 1| 52| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.7,...|
| 2500| 2008| 97| 70| 96| 95| 1| 1| 1| 1| 92| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.96...|
| 2500| 2008| 93| 86| 86| 89| 1| 1| 1| 1| 98| 0.0| 0.0| 1.0| 0.0|[0.6,0.5,0.0,0.92...|
| 2500| 2008| 80| 79| 87| 91| 1| 1| 1| 1| 93| 0.0| 1.0| 0.0| 0.0|[0.6,0.5,0.0,0.77...|
| 2500| 2008| 93| 73| 73| 93| 1| 1| 1| 1| 100| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.92...|
| 2500| 2008| 91| 90| 92| 76| 1| 1| 1| 1| 87| 0.0| 1.0| 1.0| 1.0|[0.6,0.5,0.0,0.9,...|
| 2500| 2008| 85| 74| 71| 76| 1| 1| 1| 1| 59| 0.0| 0.0| 1.0| 1.0|[0.6,0.5,0.0,0.83...|
| 2500| 2008| 86| 90| 94| 70| 1| 1| 1| 1| 90| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.84...|
| 2500| 2008| 88| 84| 94| 88| 1| 1| 1| 1| 93| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.86...|
| 2500| 2008| 95| 94| 78| 92| 1| 1| 1| 1| 98| 0.0| 1.0| 1.0| 1.0|[0.6,0.5,0.0,0.94...|
| 2500| 2008| 100| 97| 78| 97| 1| 1| 1| 1| 93| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,1.0,...|
| 2500| 2008| 82| 98| 81| 93| 1| 1| 1| 1| 97| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.8,...|
| 2500| 2008| 100| 99| 76| 85| 1| 1| 1| 1| 91| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,1.0,...|
| 2500| 2008| 93| 100| 89| 71| 1| 1| 1| 1| 81| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.92...|
| 2500| 2008| 71| 89| 100| 73| 1| 1| 1| 1| 84| 0.0| 1.0| 0.0| 0.0|[0.6,0.5,0.0,0.67...|
| 2500| 2008| 98| 89| 93| 99| 1| 1| 1| 1| 95| 0.0| 0.0| 0.0| 0.0|[0.6,0.5,0.0,0.97...|
+------------+--------+-------+-------+-------+-------+--------+--------+--------+--------+-------+-----------+--------+-------+-----+--------------------+
only showing top 20 rows
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(14,[0,1,3,4,5,6,...|
| 0.0|(14,[0,1,3,4,5,6,...|
| 0.0|(14,[0,1,3,4,5,6,...|
| 0.0|(14,[0,1,3,4,5,6,...|
| 0.0|(14,[0,1,3,4,5,6,...|
+-----+--------------------+
only showing top 5 rows
RandomForest Model Description :
RandomForestClassificationModel (uid=rfc_2681b3e83f64) with 3 trees
Tree 0 (weight 1.0):
If (feature 12 in {1.0,2.0})
If (feature 12 in {2.0})
Predict: 1.0
Else (feature 12 not in {2.0})
If (feature 2 in {0.0,1.0,5.0,6.0})
If (feature 11 in {0.0,2.0})
If (feature 11 in {2.0})
Predict: 1.0
Else (feature 11 not in {2.0})
Predict: 0.0
Else (feature 11 not in {0.0,2.0})
Predict: 1.0
Else (feature 2 not in {0.0,1.0,5.0,6.0})
Predict: 1.0
Else (feature 12 not in {1.0,2.0})
If (feature 7 in {1.0})
If (feature 10 in {1.0})
Predict: 1.0
Else (feature 10 not in {1.0})
If (feature 2 in {6.0})
Predict: 0.0
Else (feature 2 not in {6.0})
Predict: 1.0
Else (feature 7 not in {1.0})
If (feature 9 in {1.0})
If (feature 3 <= 0.6722222222222223)
If (feature 2 in {0.0,1.0,2.0,6.0})
Predict: 0.0
Else (feature 2 not in {0.0,1.0,2.0,6.0})
Predict: 1.0
Else (feature 3 > 0.6722222222222223)
If (feature 2 in {0.0,1.0,2.0,6.0})
Predict: 0.0
Else (feature 2 not in {0.0,1.0,2.0,6.0})
Predict: 1.0
Else (feature 9 not in {1.0})
If (feature 11 in {0.0,2.0})
If (feature 8 in {1.0})
Predict: 1.0
Else (feature 8 not in {1.0})
Predict: 0.0
Else (feature 11 not in {0.0,2.0})
If (feature 13 <= 0.6377551020408163)
Predict: 1.0
Else (feature 13 > 0.6377551020408163)
Predict: 0.0
Tree 1 (weight 1.0):
If (feature 12 in {1.0,2.0})
If (feature 2 in {0.0,1.0,6.0})
If (feature 0 in {2.0,4.0,5.0,6.0,7.0})
If (feature 4 <= 0.6989795918367347)
If (feature 0 in {2.0,4.0,5.0,7.0})
Predict: 0.0
Else (feature 0 not in {2.0,4.0,5.0,7.0})
Predict: 1.0
Else (feature 4 > 0.6989795918367347)
If (feature 9 in {1.0})
Predict: 1.0
Else (feature 9 not in {1.0})
Predict: 0.0
Else (feature 0 not in {2.0,4.0,5.0,6.0,7.0})
If (feature 6 <= 0.6818181818181819)
Predict: 1.0
Else (feature 6 > 0.6818181818181819)
If (feature 11 in {1.0,2.0})
Predict: 1.0
Else (feature 11 not in {1.0,2.0})
Predict: 0.0
Else (feature 2 not in {0.0,1.0,6.0})
If (feature 2 in {5.0,7.0})
If (feature 12 in {2.0})
Predict: 1.0
Else (feature 12 not in {2.0})
If (feature 6 <= 0.6818181818181819)
Predict: 0.0
Else (feature 6 > 0.6818181818181819)
Predict: 1.0
Else (feature 2 not in {5.0,7.0})
Predict: 1.0
Else (feature 12 not in {1.0,2.0})
If (feature 8 in {1.0})
If (feature 1 in {16.0})
If (feature 13 <= 0.6377551020408163)
If (feature 5 <= 0.995)
Predict: 1.0
Else (feature 5 > 0.995)
Predict: 0.0
Else (feature 13 > 0.6377551020408163)
Predict: 0.0
Else (feature 1 not in {16.0})
If (feature 3 <= 0.7055555555555555)
Predict: 1.0
Else (feature 3 > 0.7055555555555555)
If (feature 0 in {3.0,4.0,6.0})
Predict: 0.0
Else (feature 0 not in {3.0,4.0,6.0})
Predict: 1.0
Else (feature 8 not in {1.0})
If (feature 10 in {1.0})
If (feature 7 in {1.0})
Predict: 1.0
Else (feature 7 not in {1.0})
If (feature 1 in {9.0,15.0,16.0})
Predict: 0.0
Else (feature 1 not in {9.0,15.0,16.0})
Predict: 1.0
Else (feature 10 not in {1.0})
If (feature 9 in {1.0})
Predict: 1.0
Else (feature 9 not in {1.0})
Predict: 0.0
Tree 2 (weight 1.0):
If (feature 7 in {1.0})
If (feature 9 in {1.0})
Predict: 1.0
Else (feature 9 not in {1.0})
If (feature 2 in {0.0,6.0})
If (feature 1 in {3.0,12.0})
If (feature 0 in {4.0})
Predict: 0.0
Else (feature 0 not in {4.0})
Predict: 1.0
Else (feature 1 not in {3.0,12.0})
If (feature 12 in {0.0})
Predict: 0.0
Else (feature 12 not in {0.0})
Predict: 1.0
Else (feature 2 not in {0.0,6.0})
Predict: 1.0
Else (feature 7 not in {1.0})
If (feature 8 in {1.0})
If (feature 12 in {1.0})
Predict: 1.0
Else (feature 12 not in {1.0})
If (feature 2 in {1.0,6.0})
If (feature 11 in {0.0})
Predict: 0.0
Else (feature 11 not in {0.0})
Predict: 1.0
Else (feature 2 not in {1.0,6.0})
Predict: 1.0
Else (feature 8 not in {1.0})
If (feature 11 in {1.0,2.0})
If (feature 12 in {1.0,2.0})
Predict: 1.0
Else (feature 12 not in {1.0,2.0})
If (feature 2 in {0.0,1.0,5.0,6.0})
Predict: 0.0
Else (feature 2 not in {0.0,1.0,5.0,6.0})
Predict: 1.0
Else (feature 11 not in {1.0,2.0})
Predict: 0.0
+--------------+-----+--------------------+
|predictedLabel|label| features|
+--------------+-----+--------------------+
| 0.0| 0.0|(14,[0,1,2,3,4,5,...|
| 0.0| 0.0|(14,[0,1,2,3,4,5,...|
| 0.0| 0.0|(14,[0,1,2,3,4,5,...|
| 0.0| 0.0|(14,[0,1,2,3,4,5,...|
| 0.0| 0.0|(14,[0,1,2,3,4,5,...|
+--------------+-----+--------------------+
only showing top 5 rows
@ Accuracy Rate = 0.9149558797601713
@ Error Rate = 0.08504412023982866
Confusion Matrix:
1117715.0 37138.0
100817.0 366488.0
@ Precision Rate(0.0) = 0.9172635597587917
@ Precision Rate(1.0) = 0.9079890789988752
Recall Rate(0.0) = 0.9678417945833798
Recall Rate(1.0) = 0.7842586747413359
False Positive Rate(0.0) = 0.21574132525866405
False Positive Rate(1.0) = 0.03215820541662012
F1-Score(0.0) = 0.9418741586384004
F1-Score(1.0) = 0.8416005401116734
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer, StringIndexerModel, VectorAssembler}
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.util.MLUtils
ds: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 18 more fields]
dsSmartCar: org.apache.spark.sql.DataFrame = [car_capacity: bigint, car_year: bigint ... 13 more fields]
dsSmartCar_1: org.apache.spark.sql.DataFrame = [car_capacity: bi...
결과값
Precision - 비정상으로 분류한 스마트카 중 실제 비정상인 스마트카 비율
Recall - 실제 비정상 스마트카 중에서 모델이 비정상 스마트카로 분류한 비율
F1-Score - Precision 과 Recall 의 조화 평균
코드를 진행하다가 아래와 같은 에러가 나올 수 있는데 다음과 같은 방법들을 실행 해 보자.
java.lang.NullPointerException at org.apache.thrift.transport.TSocket.open(TSocket.java:170)
◼ 제플린 실행시 오류 발생하면 클라우데라 로그 삭제
rm -rf /var/log/cloudera-scm-server/*
◼ hdfs 안전모드 해제
hdfs dfsadmin -safemode leave
◼ cluster1 > 재시작
zeppelin-daemon.sh restart
'빅데이터' 카테고리의 다른 글
[빅데이터] 군집 분석 (2) | 2024.06.19 |
---|---|
[빅데이터] 머하웃 사용해 보기. (0) | 2024.06.18 |
[빅데이터] 제플린 사용해 보기. (0) | 2024.06.18 |
[빅데이터] Impala 사용해 보기. (2) | 2024.06.18 |
[빅데이터] 임팔라, 스쿱, 머하웃 설치 With Cloudera, 제플린 설치 With Linux (0) | 2024.06.18 |