logger

Elasticsearch logger

로그 엔트리 3개

메인로그 (cluster-name.log)

: 동작 중 일때, 무슨일이 일어났는지에 관한 일반적인 정보를 알수 있다. 쿼리 실패, 새로운 노드 클러스터에 추가

느린 검색 로그(cluster-name_index_search_slowlog.log)

쿼리가 너무 느리게 실행될때 로그를 남기는 곳

느린 색인 로그(cluster-name_index_indexing_slowlog.log)

느린 검색로그와 유사하지만 기본으로 색인 작업이 일정 시간 지나면 로그를 남기는 곳

실행 중이기 때문에 curl 을 이용해서 바로 적용


x
curl -XPUT "http://localhost:9200/index_name/_settings" -d '{
  "index.search.slowlog.threshold.query.debug" : "0s",
  "index.search.slowlog.threshold.fetch.debug": "0s",
  "index.indexing.slowlog.threshold.index.debug": "0s"
}'

https://stackoverflow.com/questions/21749997/see-all-executed-elasticsearch-queries

저작자표시 비영리 (새창열림)

'BackEnd > ElasticSearch' 카테고리의 다른 글

[Elasticsearch] failed to obtain node locks (0)	2019.06.21
SuggestAPI 소개 (0)	2019.06.10
Nori 사용 해서 노래 가사 검색하기 (0)	2019.06.10
Elasticsearch Nori (0)	2019.05.27
Elasticsearch 모니터링 툴 (0)	2019.03.04

SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(2)

sample, take, takeSample 연산으로 RDD 의 일부 요소 가지고 오기

sample

이전의 고객 ID 중 30% 를 무작위로 고른 샘플 데이터셋이 필요하다고 가정할때,

RDD 클래스에 sample 메서드 사용 가능

def sample(widthReplacement: Boolean,fraction:Double, seed:Long=Util.random.nextLong):RDD[T]

첫 번째, widthReplacement 는 같은 요소가 여러번 샘플링될수 있는지에 대한지정

true : 복원샘플링, false : 비복원 샘플링

복원 샘플링은 물고기를 잡았을 때 다시 물고기를 살려주고 다시 물고기를 잡는 상황이라고 볼 수있고,

비복원 샘플링은 반대로 물고기를 잡았을때, 물고기를 제외하고 다시 물고기를 잡는 상황이라고 이해하면된다.

두 번째, fraction 샘플링될 횟수의 기댓값을 의미

세 번째, seed 는 난수 생성에 사용되는 시드로, 같은 시드는 항상 같은 유산 난수를 생성하기 때문에 프로그램을 테스트 하는데 쓰임.

이전의 예제인 uniqueIds 의 값들 중 샘플링 테스트를 해보자.

scala> val uniqueIds = idsStr.distinct
uniqueIds: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at distinct at <console>:28

scala> uniqueIds.collect
res4: Array[String] = Array(80, 20, 98, 15, 16, 31, 94, 77)

scala> val ss = uniqueIds.sample(false,0.3)
ss: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[11] at sample at <console>:30
/*
sample 의 반환형은 RDD 
*/
scala> ss.count
res10: Long = 2

scala> ss.collect
res11: Array[String] = Array(20, 98)

takeSample

위의 sample 메서드는 확률을 통해 값을 가지고 왔지만, 갯수를 가지고 sample을 하려면 takeSmaple 을 사용

def takeSample(widthReplacement: Boolean, num:Int, seed: Long=Utils.random.nextLong):Array[T]

첫번째 인자 : 복원,비복원

두번째는 가지고올 갯수

세 번째, seed 는 난수 생성에 사용되는 시드로, 같은 시드는 항상 같은 유산 난수를 생성하기 때문에 프로그램을 테스트 하는데 쓰임

scala> val taken = uniqueIds.takeSample(false,5)
taken: Array[String] = Array(31, 77, 94, 15, 16)
/*
takeSample의 반환형은 Array 로 반환하게 된다.
*/
scala> uniqueIds.take(3)
res12: Array[String] = Array(80, 20, 98)

take 는 RDD 에서 갯수만큼 가지고 오는 연산자인데, 지정된 개수의 요소를 모을때까지 RDD 파티션 하나씩 처리해 결과를 반환한다.

(파티션을 하나씩 처리 한다는 것은 결국 연산이 전혀 분산이 되지 않는다는 것을 의미한다. 여러 파티션의 요소를 빠르게 가져오고 싶다면 드라이버의 메모리를 넘지 않도록 요소 개수를 적당히 줄이고 collect 연산자를 사용한다.)

저작자표시 비영리 (새창열림)

'BackEnd > Spark' 카테고리의 다른 글

RDD 영속화(캐싱) (0)	2019.09.02
Spark BroadCast (0)	2019.08.28
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(1) (0)	2018.06.17
Spark(3) SparkContext-1 (0)	2018.05.16
Spark (2) 기본예제 및 scala (0)	2018.05.15

RDD연산자의 종류는 transformation 과 action둘로 나뉘는데

transformation 은 새로운 RDD 를 생성

action 은 RDD 의 연산자를 호출함

스파크에서 transformation , 과 action둘의 지연 실행

spark 에 대해서 lazy evaluation 개념이 중요한데, 처음에는 lazy 에 대한 이해를 하지 못 한채 그냥 그렇구나 했는데, 개념은 다음과 같다.

transformation의 지연 실행은 action 연산자를 호출하기 전까지는 transformation 연산자의 계산을 실제로 실행 하지 않는 것을 의미한다.
이는 RDD에 action연산자가 호출되면 스파크는 해당 RDD 의 계보를 살펴본 후, 이를 바탕으로 실행해야하는 연산 그래프를 작성해서 action 연산자를 계산한다. 
결론은 transformation 연산자는 action 연산자를 호출했을때, 무슨 연산이 어떤 순서로 실행되어야 할지 알려주는 일종의 설계도 라고 할 수 있다.

책의 예제를 따라다 우연히 lazy evaluation 의 예제를 찾은 것같다.

scala> val lines = sc.textFile("/home/morris01/study/spark/data/client-ids.log")
lines: org.apache.spark.rdd.RDD[String] = /home/morris01/study/spark/data/client-ids.log MapPartitionsRDD[4] at textFile at <console>:24

scala> val idsStr = lines.map(line=>line.split(","))
idsStr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:26

scala> idsStr.foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/morris01/study/spark/data/client-ids.log
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)

위의 실행 예제가 에러가 난이유는 filepath가 잘못되어서 나온 에러이다.

그러나 idsStr.foreach를 실행하기전까지는 순수히 진행이 되는 것 같았다. 하지면 foreach 라는 action을 수행을 하면서 이전의 RDD 의 계보를 살펴보다가 잘못되어서 에러가 발생한것같다

RDD 연산자

원본 RDD 의 각 요소를 변환한 후 변환된 요소로 새로운 RDD를 생성하는 map 변환 연산자

scala> val numbers = sc.parallelize(10 to 50 by 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> numbers.foreach(x=>println(x))
10
20
30
40
50

scala> val numberSquared = numbers.map(num=>num*num)
numberSquared: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at map at <console>:26

scala> numberSquared.foreach(x=>println(x))
100
400
900
1600
2500

scala> numberSquared.foreach(println)
100
400
900
1600

distinct, flatMap 연산자

예제 데이터는 물건을 구매한 ID 값을 가진 log 파일이다.

echo "15,16,20,20
77,80,94
94,98,16,31
31,15,20" > ~/client-ids.log

scala> val lines = sc.textFile("/home/morris01/study/spark/data/sparkinaction/client-ids.log")
lines: org.apache.spark.rdd.RDD[String] = /home/morris01/study/spark/data/sparkinaction/client-ids.log MapPartitionsRDD[7] at textFile at <console>:24

scala> val idsStr = lines.map(line=>line.split(","))
idsStr: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[8] at map at <console>:26

scala> idsStr.foreach(println)
[Ljava.lang.String;@77278a7d
[Ljava.lang.String;@6876c229
[Ljava.lang.String;@25f5ac40
[Ljava.lang.String;@2d06d673
/*
idsStr 에는 string 하나, 하나의 rdd 가생성되는걸로 예상했는데 
string 배열을 가진 RDD 가 생성되었다. 
*/
scala> idsStr.first
res5: Array[String] = Array(15, 16, 20, 20)

scala> idsStr.collect
res6: Array[Array[String]] = Array(Array(15, 16, 20, 20), Array(77, 80, 94), Array(94, 98, 16, 31), Array(31, 15, 20))
/*
collect 를 사용하여 새로운 배열을 생성 , RDD의 모든 요소를 이 배열에 모아서 반환
*/

이 배열을 단일 배열로 분해 하려면 flatMap을 사용하게된다.

flatMap은 RDD 모든 요소에 적용이 된다.

익명함수가 반환한 배열의 중첩구조를 한단계 제거하고 모든 배열의 요소를 단일 컬렌션으로 병합한다는것이 flatmap 과 map 의 다른 점이다.

scala 에 대한 지식중 TraversableOnce 에 대해서 꼭 알 필요가 있다.

이유는 flatMap의 시그니쳐는 다음과 같이 가지고 있기 때문이다.

def flatMap[U](f:(T)=>TraversableOnce[U]):RDD[U]

map으로 연산을 했던 것을 flatMap을 사용하게 되면 하나의 배열로 값을 불러올 수 있는 것을 확인 할 수 있다.

scala> val idsStr = lines.flatMap(line=>line.split(","))
idsStr: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> idsStr.collect
res1: Array[String] = Array(15, 16, 20, 20, 77, 80, 94, 94, 98, 16, 31, 31, 15, 20)
/*
String 의 값을 Int 로 반환해주기 위해서는 _.toInt 메서드를 사용하면된다.
*/
scala> val idsInt = idsStr.map(_.toInt)
idsInt: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map at <console>:28

scala> idsInt.collect
res2: Array[Int] = Array(15, 16, 20, 20, 77, 80, 94, 94, 98, 16, 31, 31, 15, 20)

Distinct

구매 고객들의 아이디 값 들을 연산하기 쉽게 하나의 배열로 나타냈지만, 구매고객의 수를 구하려면 중복을 제거를 해주어야한다.

보통은 Scala의 Set 함수에 다시 넣을수도있겠지만, 간편하게 Distinct 를 사용하면 된다.

scala> val uniqueIds = idsInt.distinct
uniqueIds: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at distinct at <console>:30

scala> uniqueIds.collect
res4: Array[Int] = Array(15, 77, 16, 80, 98, 20, 31, 94)

scala> val finalCount = uniqueIds.count
finalCount: Long = 8

예제 파일 github : https://github.com/spark-in-action/first-edition/blob/master/ch02/scala/ch02-listings.scala

저작자표시 비영리 (새창열림)

'BackEnd > Spark' 카테고리의 다른 글

Spark BroadCast (0)	2019.08.28
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(2) (0)	2018.06.18
Spark(3) SparkContext-1 (0)	2018.05.16
Spark (2) 기본예제 및 scala (0)	2018.05.15
SPARK(1)-환경 구축 (1)	2018.05.13

SPARK(3) SparkContext

지난 시간의 Spark의 실행하기 위해

val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[1]")
      .set("spark.executor.memory", "4g")
      .set("spark.driver.memory", "4g")
val sc = new SparkContext(conf)

다음과 같은 코드를작성 하였는데

풀이를 하면

SparkConf() 는 SparkContext를 생성 하기 위한 설정 파일이다.

setAppName()은 Spark어플리케이션의 이름

setMaster() 는 로컬 피씨에서 사용 하기 위해 local로 적어준것이고 [ N ] 은 실행할 스레드의 개수(core)를 의미한다.

로컬 피씨의 전체 의 core를 쓰고 싶다면 [*] 를 사용해주면 된다.

SparkContext 객체는 클러스터상에서 스파크 작업 실행을 관리 하는 객체이다.

SparkContext 는 많은 유용한 메서드를 제공하는데 , 많이 쓰이는 것은 탄력적 분산 데이터셋 을 생성하는 메서드들을 가장 자주 사용하게 된다.

탄력적 분산 데이터셋(RDD) 은 클러스터의 여러 노드에 파티션으로 나뉘어 분산되며, 각 파티션은 RDD전체 데이터중 일부를 담게 된다.

여기서 파티션의 의미는 스파크에서 병렬 처리되는 단위.

RDD를 생성하는 간단한 방법은 로컬 객체 컬렉션을 인수로 SparkContext의 parallelize 메서드를 실행 하는것.

val rdd = sc.parallelize(Array(1,2,3,4),4)

첫 번째 파라미터는 병렬 처리 하려는 객체 컬렉션을 나타내며, 두번째 인수는 파티션의 개수이다.

파티션내의 객체들에 대한 연산을 수행하게 될때, 스파크는 구동자 프로세스로 부터 객체 컬랙션의 일부를 가지고 온다.

RDD를 HDFS, 텍스트파일 를 포함한 디렉토리로부터 생성하기 위해서는

val rdd2 = sc.textFile('hdfs:///hadoopData/process01.txt')

textFile 메서드를 사용한다. 단, 메서드의 인수로 디렉토리 이름을 입력하게되면 스파크는 그 디렉터리의 모든 파일을 RDD로 구성요소로 간주하게된다. 이렇게 parallelize, textFile등의 메서드의 코드 의 시점까지는 데이터를 읽어들이거나, 메모리에 올리는 일은 실제로 일어나지 않는다. 스파크는 파티션 내의 객체들에 대해 연산을 수항할때가 되서야 섹션=스플릿 단위로 읽어 RDD에 정의한 필터링 과 같은 집계같은 작업을 통해 가공을 함.

오류가 있거나 궁금한점있으면 같이 공유했으면 좋겠습니다.

저작자표시 비영리 (새창열림)

'BackEnd > Spark' 카테고리의 다른 글

Spark BroadCast (0)	2019.08.28
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(2) (0)	2018.06.18
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(1) (0)	2018.06.17
Spark (2) 기본예제 및 scala (0)	2018.05.15
SPARK(1)-환경 구축 (1)	2018.05.13

SPARK (2)spark 추가 하기

SPARK (2) - Spark 환경 설정 및 scala trait

Spark gradle dependncy 추가

    compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.1'
    compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.0'
    compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.1.0'
    compile group: 'org.apache.spark', name: 'spark-mllib_2.11', version: '2.1.0'

build.gradle 파일 오른쪽 마우스 클릭 gradle refresh

main함수 안에 다음고 같이 적어주고

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext

object MainScala {
  def main(arg: Array[String]) {
	val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[1]")
          .set("spark.executor.memory", "4g")
          .set("spark.driver.memory", "4g")
	val sc = new SparkContext(conf)

	println("=========================")
	println("Hello Spark")
	println("=========================")
	sc.stop()
  }
}

ScalaApplication 을 실행 해주면 실행 완료

RDD 프로그래밍 하기

graph LR
A[RDD 연산]-->B[transformation]
A[RDD 연산]-->C[action]

python으로 예를 들면

lines = sc.textFile("README.md")

Transformation : 존재하는 RDD에서 새로운 RDD를 만들어낸다.(예제는 python)

예를 들면 표현식과 일치하는 어떤 데이터든 걸러내는것이 있을 때

pythonLines = lines.filter(lamda line: "Python" in line)

Action : RDD를 기초로 결과 값을 계산하며, 그 값을 드라이버 프로그램에 되돌려주거나 외부 스토리지(예. HDFS)에 저장 하기도한다.

예를 들면 기존에서 이미 써본적이 있는 액션으로는 첫번째 요소를 되돌려주는 first가 있다.

pythonLines.first()

다음시간에는 spark shell을 이용한 기초 RDD 조작방법에 대해 포스팅 하겠습니다.

SCALA Trait

spark 를 scala 언어로 사용하는데 있어서 scala에 대한 정리도 추가로 하겠습니다.

잘못된 정보가 있으면 댓글로 알려주세요.

scala trait 믹스인

믹스인 ?? 개념이 헷갈린다 .

우선 실습을 하자면

mathFunction trait 은 sumTest 함수는 int 형으로 반환해야한다는 것을 정의하고 있다.

또한 sum 은 mathFunction을 상속 받고 있다.

//mathFunction.scala

trait mathFunction {
  def sumTest(value:Int):Int
  def averageTest(values:Array[Int]):Int
}

class mathLecture extends mathFunction{
  
   override def sumTest(value:Int):Int = {
      3+value
   }
   
   override def averageTest(values:Array[Int]):Int={
     var sum = 0;
     for(i<-(0 to values.length-1)){
       sum += values(i)
     }
     sum/values.length
   }
   
}

아래의 코드는 mathFunction 의 trait 타입을 가지는 mathClass 값을 설정 하였다.

mathClass의 리터럴을 sum() 클래스를 지정해주면 , mathClass 는 mathFunction 타입으로 sumTest 함수를 정의만 하고있지만, sum class 에 정의된 sumTest 함수를 사용할수 있다.

//MainScala.scala
object MainScala {
  def main(arg: Array[String]) {
    val mathClass:mathFunction = new mathLecture()

    println("value +3 :  "+mathClass.sumTest(4))
    println("average : "+mathClass.averageTest(Array(1,2,3,4,5)))
  }
}

실행 결과

value +3 :  7
average : 3

참고 사이트

https://docs.scala-lang.org/ko/tutorials/tour/traits.html.html

sequential collection 에 사용하기 유용한 함수

zipWithindex

val days = Array("Sunday","Month","Tuesday","Wednesday","Thursday","Saturday")

days.zipWidthIndex.foreach{
    case(day,count) => println(s"$count is $day")
}

결과 값

0 is Sunday
1 is Monday
2 is Tuesday
3 is Wednesday
4 is Thursday
5 is Friday
6 is Saturday

zip 두 리스트의 원소들의 쌍으로 이루어진 단일 리스트를 반환
```
List(1, 2, 3).zip(List("a", "b", "c"))
```

결과 값

List[(Int, String)] = List((1,a), (2,b), (3,c))

저작자표시 비영리 (새창열림)

'BackEnd > Spark' 카테고리의 다른 글

Spark BroadCast (0)	2019.08.28
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(2) (0)	2018.06.18
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(1) (0)	2018.06.17
Spark(3) SparkContext-1 (0)	2018.05.16
SPARK(1)-환경 구축 (1)	2018.05.13

SPARK_ENV(1)

SPARK 공부하기 (1) 환경구축

Spark 설치 하기

 
x
ecilpse 에서 gradle 을 이용하여 spark를 위한 scala 로된 프로젝트를 준비하기 !!

1. Scala 프로젝트 gradle 이클립스에 설치하기

eclipse 사전 설치 해야 할 플러그인

 
x
Scala plugin 설치 
Gradle

File -> new -> Other ->Gradle

위의 화면 대로 next 를 누르다가 ProjectName을 적어주고 다음 다음 누르고 Finish를 누르면 된다.

그럼 아래와 같은 프로젝트가 생성이 된다.

build.gradle 파일을 선택하게 되면 다음과 같은 초기 코드가 나온다

기존의 코드가 무엇인지 궁금하여 찾아봤는데,

repositories -> jcenter

 
xxxxxxxxxx
repositories{
    jcenter()
}

jcenter 란

jcenter는 공개 소스 라이브러리 게시자에게 무료로 제공되는 bintray에서 호스팅되는 공개 저장소입니다. Maven Central에서 jcenter를 사용하는 데에는 여러 가지 이유가 있습니다. 다음은 주요 기능 중 일부입니다.

jcenter는 CI 및 개발자 빌드의 개선을 의미하는 CDN을 통해 라이브러리를 제공합니다.
jcenter는 지구상에서 가장 큰 Java 저장소입니다. 즉, Maven Central에서 사용할 수있는 것은 jcenter에서도 사용할 수 있습니다.
bintray에 자신의 라이브러리를 업로드하는 것은 매우 쉽습니다. Maven Central에서 서명하거나 복잡한 작업을 수행 할 필요가 없습니다.
친숙한 UI 라이브러리를 Maven Central에 업로드하려는 경우 bintray 사이트를 한 번의 클릭으로 쉽게 할 수 있습니다.

참고 : http://code.i-harness.com/ko/q/17f906f

apply plugin : ' PluginName '
: ''PluginName" 을 Gradle 플래그인으로 적용
dependencies : 의존성 관리로 사용될 외부 라이브러리에 대한 의존성을 설정 하는 부분
의존성을 jcenter에서 받아온다면 repository 에 jcenter를 추가 해주고, maven repository에서 의존성을 받아온다면 mavenCntral()를 추가 해주면된다.

저는 jcenter 보다 maven Repository를 많이 사용하기 때문에 maven repository를 추가 하였습니다.

 
xxxxxxxxxx
/*
 * This build file was generated by the Gradle 'init' task.
 *
 * This generated file contains a sample Java Library project to get you started.
 * For more details take a look at the Java Libraries chapter in the Gradle
 * user guide available at https://docs.gradle.org/4.3/userguide/java_library_plugin.html
 */
// Apply the java-library plugin to add support for Java Library
apply plugin: 'java-library'
apply plugin: 'scala'
apply plugin: 'eclipse'
// In this section you declare where to find the dependencies of your project
repositories {
    // Use jcenter for resolving your dependencies.
    // You can declare any Maven/Ivy/file repository here.
    mavenCentral()
    mavenLocal()
    
}
dependencies {
    // This dependency is exported to consumers, that is to say found on their compile classpath.
    compile group: 'org.apache.commons', name: 'commons-math3', version: '3.6.1'
    // This dependency is used internally, and not exposed to consumers on their own compile classpath.
    compile group: 'com.google.guava', name: 'guava', version: '23.0'
    // Use JUnit test framework
    testCompile group: 'junit', name: 'junit', version: '4.12'
    
    //input dependencies
    compile group: 'org.slf4j', name: 'slf4j-api', version: '1.7.5'
    compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.2'
}

이후에는 프로젝트 디렉토리를 오른쪽 마우스를 눌러서

를 누르면 업데이트한 build.gradle 을 Refresh 해준다.

그럼 거의 세팅은 끝났고 이제 코드를 작성 해야하는데, 프로젝트 root 폴더에서 src/main/scala 디렉토리를 만들어준다.

그리고 Main.scala 라는 파일을 생성 한다.

Main.scala

 
xxxxxxxxxx
object Main {
  def main(args:Array[String]){
    print("Complete Gradle Scala Project");
  }
}

생성한 뒤 시작을 하게 되면

print문이 실행되는 것을 확인할 수 있다.

참고 사이트 :

http://techs.studyhorror.com/gradle-scala-eclipse-project-i-181

https://medium.com/@goinhacker/%EC%9A%B4%EC%98%81-%EC%9E%90%EB%8F%99%ED%99%94-1-%EB%B9%8C%EB%93%9C-%EC%9E%90%EB%8F%99%ED%99%94-by-gradle-7630c0993d09

저작자표시 비영리 (새창열림)

'BackEnd > Spark' 카테고리의 다른 글

Spark BroadCast (0)	2019.08.28
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(2) (0)	2018.06.18
SPARK 에서의 기본 행동(action) 연산자 및 변환(transformation)연산자(1) (0)	2018.06.17
Spark(3) SparkContext-1 (0)	2018.05.16
Spark (2) 기본예제 및 scala (0)	2018.05.15

복합 이벤트 처리 (Complex Event Processing)에 대한 연구를 하다 보니 오픈 소스는 Esper, Siddhi 기반 으로한 WSO2 등 많은 엔진이 있다.

하지만 데이터스트림을 처리하는데 있어서 오픈 프레임워크는 spark와 storm 대표적이다.

Spark 같은 경우는 Memory를 기반으로 분산 컴퓨팅을 지원한다고 하는데 , 추후에 시간이 있을 때 다시 정리를 하겠습니다.

STORM 같은 경우에는 오픈소스로 오래 되었고, 안정적이라고한다.

아파치 스톰 홈페이지 http://storm.apache.org/index.html

Why use Storm?

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.

스톰을 쓰는 곳은 트위터, 야후, 엘프, 필립보드, 그룹폰 등등 많은 곳에서 쓰고 있다.

Storm 구성하는 기본 요소

Topology, stream, spout, bolt 로 구성

Topology

스톰의 분산 연산 구조로 stream, spout, bolt로 구성됨.

spout과 bolt 간의 연관관계를 정의해서 데이터의 관계를 정의해 놓은 것 .

: 하둡과 같은 배치 처리 시스템의 잡과 거의 비슷하지만, Batch Job은 연산의 처음과 끝지점이 명확하게 정의되어 있는 반면에

Storm의 Topology는 죽이거나 언디플로이 할 때까지 계속 동작.

Stream

튜플은 스톰의 기본 데이터 구조체이며, key-value pair의 목록이며, 스트림은 연속된 튜플들로 정의를 함.

스톰 튜플을 CEP의 이벤트로 생각하면 된다.

Spout

Spout에서는 데이터를 읽어들이는 데이터 소스(Storm Topology로들어가는 입구)이다. 또한 데이터를 나타내는 tuple을 생성하는데, Tuple은 데이터를 보관하는 단위를 나타낸다.

Adaptor로 동작하는데 데이터 소스와 연결을 맺고 데이터를 튜플로 변환하여 스트림으로 튜플을 내보내는 일을 한다.

Bolt

Bolt는 읽어드린 데이터를 처리하는 함수, CEP에서는 연산자, 실시간연산으로 생각하면됨, 입력값으로 데이터 스트림을 받고, 로직에 따라 다른 bolt로 넘겨주거나 종료.

위의 그림과 같이 토폴로지는 하나의 spout과 여러 개의 bolt로 이루어진다.

https://trenbe.onelink.me/uRnQ/b38091c7

저작자표시 (새창열림)

'BackEnd > ETC' 카테고리의 다른 글

[gradle] CreateProcess error=206 (0)	2020.04.25
Git remote: Permission to (0)	2020.03.08
http 상태 코드 (0)	2020.01.19
CQRS란 ? (1)	2020.01.19

분산 마리아 디비 구성하기

http://mytalkhome.tistory.com/840

Spider On MariaDB

http://yakolla.tistory.com/69

spider : 기존 테이블 파티션 기능을 확장하여 원격으로 저장하고 읽을수있으며, SQL문과 디비 서버 의 환경 또는 구조를 변경하지않고도 샤딩이 가능함 .

서로 다른 MySQL 인스턴스 테이블을 동일 인스턴스 테이블과 같이 다룰 수 있다.
xa 트랜잭션을 포함한 트랜잭션을 지원하기 때문에 갱신계 DB 클라스터링으로 이용 할 수 있다
테이블 파티션을 지원하고 있기 때문에 파티션 룰을 이용하여 동일 테이블의 데이터를 복수의 서버에 분산 배치 할 수 있다
spider 스토리지 엔진 테이블을 만들면 MySQL 내부에서는 파일로의 심볼릭 링크 같이 하여 리모트 서버 테이블에 테이블 링크를 생성한다
링크처의 테이블 스토리지 엔진에 제한이 없다

Mariadb +Galera cluster
- https://mariadb.com/kb/en/mariadb/getting-started-with-mariadb-galera-cluster/
- http://codesanctum.net/mariadb-galera-cluster-%EC%84%A4%EC%B9%98/
- http://dev.dwuthk.com/entry/MariaDB-Galera-Cluster-%EC%84%A4%EC%B9%98-%EB%B0%8F-%EC%84%A4%EC%A0%95
Galera cluster 에 대한 설명
http://bcho.tistory.com/1062

종합적인 마리아 디비 설치 피피티
http://www.slideshare.net/junghaelee10/mariadb-58514643

저작자표시 (새창열림)

'BackEnd > SQL' 카테고리의 다른 글

public key retrieval is not allowed 해결 (1)	2020.02.15

기본이 제일 중요해!!

BackEnd