# Elasticsearch 모니터링 및 장애 판단 지표

## 개요

Elasticsearch 클러스터의 상태를 모니터링하고 장애를 판단하기 위한 API와 지표를 정리한 문서입니다.

---

## 1. 클러스터 상태 (가장 중요)

### API

```bash
GET /_cluster/health
```

### 응답 예시

```json
{
  "cluster_name": "my-cluster",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 50,
  "active_shards": 100,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}
```

### 장애 판단 기준

| 상태 | 의미 | 장애 여부 | 조치 |
|------|------|-----------|------|
| **green** | 모든 샤드 정상 | 정상 | - |
| **yellow** | 프라이머리 OK, 레플리카 문제 | 주의 | 레플리카 샤드 확인 |
| **red** | 프라이머리 샤드 유실 | **장애** | 즉시 조치 필요 |

### 주요 체크 항목

| 필드 | 장애 기준 |
|------|-----------|
| `status` | `red` |
| `unassigned_shards` | `> 0` |
| `number_of_pending_tasks` | 지속적으로 증가 |

---

## 2. 노드 상태

### API

```bash
# 노드 목록 및 리소스 사용량
GET /_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role,master

# 노드 상세 통계
GET /_nodes/stats
```

### 장애 판단 기준

| 지표 | 정상 | 주의 | 장애 |
|------|------|------|------|
| 노드 수 | 예상과 동일 | - | 예상보다 적음 |
| `heap.percent` | < 75% | 75-85% | > 85% |
| `cpu` | < 70% | 70-90% | > 90% 지속 |
| `load_1m` | < 코어 수 | 코어 수 근접 | > 코어 수 * 2 |

---

## 3. Thread Pool (핵심 지표)

### API

```bash
# write, search thread pool 상태
GET /_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected

# 전체 thread pool
GET /_cat/thread_pool?v
```

### 응답 예시

```
node_name  name   active queue rejected
node-1     write  5      0     0
node-1     search 10     0     0
node-2     write  3      0     0
node-2     search 8      0     0
```

### 장애 판단 기준

| 지표 | 정상 | 주의 | 장애 |
|------|------|------|------|
| `rejected` | 0 | - | **> 0 (요청 유실)** |
| `queue` | 0 | 증가 추세 | 지속적 증가 |
| `active` | - | 최대치 근접 | 최대치 지속 |

### Thread Pool 종류

| 이름 | 용도 |
|------|------|
| `write` | 인덱싱, 업데이트, 삭제, 벌크 요청 |
| `search` | 검색 요청 |
| `get` | 실시간 GET 요청 |
| `analyze` | 분석 요청 |

---

## 4. JVM 메모리

### API

```bash
# JVM 메모리 상태
GET /_nodes/stats/jvm

# 간단한 힙 사용량
GET /_cat/nodes?v&h=name,heap.percent,heap.current,heap.max
```

### 장애 판단 기준

| 지표 | 정상 | 주의 | 장애 |
|------|------|------|------|
| `heap_used_percent` | < 75% | 75-85% | > 85% |
| Old GC 빈도 | 낮음 | 증가 | 빈번 (STW 발생) |
| Old GC 시간 | < 1초 | 1-5초 | > 5초 |

### GC 관련 확인

```bash
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
```

---

## 5. 디스크 사용량

### API

```bash
# 노드별 디스크 사용량
GET /_cat/allocation?v&h=node,disk.percent,disk.used,disk.avail,disk.total

# 샤드별 디스크 사용량
GET /_cat/shards?v&h=index,shard,store
```

### 장애 판단 기준 (워터마크)

| 디스크 사용률 | 상태 | ES 동작 |
|--------------|------|---------|
| < 85% | 정상 | - |
| **85%** (low) | 주의 | 새 샤드 할당 중지 |
| **90%** (high) | 위험 | 샤드 재배치 시작 |
| **95%** (flood) | **장애** | 인덱싱 차단 (읽기 전용) |

### 워터마크 설정 확인

```bash
GET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk
```

---

## 6. 샤드 상태

### API

```bash
# 샤드 상태 확인
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

# UNASSIGNED 샤드만 확인
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

# 샤드 할당 설명
GET /_cluster/allocation/explain
```

### 샤드 상태 종류

| 상태 | 의미 | 조치 |
|------|------|------|
| `STARTED` | 정상 동작 | - |
| `RELOCATING` | 다른 노드로 이동 중 | 대기 |
| `INITIALIZING` | 초기화 중 | 대기 |
| `UNASSIGNED` | **할당 안 됨** | 원인 파악 필요 |

### UNASSIGNED 원인

| 원인 | 설명 |
|------|------|
| `INDEX_CREATED` | 인덱스 생성 직후 |
| `CLUSTER_RECOVERED` | 클러스터 복구 중 |
| `NODE_LEFT` | 노드 이탈 |
| `ALLOCATION_FAILED` | 할당 실패 |
| `NO_VALID_SHARD_COPY` | 유효한 샤드 복사본 없음 |

---

## 7. 인덱싱/검색 성능

### API

```bash
# 인덱싱/검색 통계
GET /_stats/indexing,search

# 특정 인덱스
GET /my-index/_stats/indexing,search
```

### 주요 지표

| 지표 | 설명 | 장애 기준 |
|------|------|-----------|
| `indexing.index_total` | 총 인덱싱 수 | - |
| `indexing.index_failed` | 인덱싱 실패 수 | **> 0** |
| `indexing.index_time_in_millis` | 인덱싱 소요 시간 | 급증 시 주의 |
| `search.query_total` | 총 검색 수 | - |
| `search.query_time_in_millis` | 검색 소요 시간 | 평소 대비 급증 |

---

## 8. 펜딩 태스크

### API

```bash
# 대기 중인 클러스터 태스크
GET /_cluster/pending_tasks

# 실행 중인 태스크
GET /_tasks
```

### 장애 판단 기준

| 지표 | 정상 | 주의 | 장애 |
|------|------|------|------|
| 펜딩 태스크 수 | 0 | 일시적 증가 | 지속적 증가 |
| `time_in_queue_millis` | < 1000 | 1000-5000 | > 5000 |

---

## 장애 판단 체크리스트 (우선순위)

| 순위 | 지표 | API | 장애 기준 | 심각도 |
|------|------|-----|-----------|--------|
| 1 | 클러스터 상태 | `/_cluster/health` | `status: red` | Critical |
| 2 | Thread Pool Rejected | `/_cat/thread_pool/write?v` | `rejected > 0` | Critical |
| 3 | 힙 메모리 | `/_cat/nodes?h=heap.percent` | `> 90%` | High |
| 4 | 디스크 사용량 | `/_cat/allocation?v` | `> 90%` | High |
| 5 | Unassigned 샤드 | `/_cluster/health` | `unassigned_shards > 0` | High |
| 6 | 노드 수 | `/_cat/nodes` | 예상보다 적음 | High |
| 7 | 인덱싱 실패 | `/_stats/indexing` | `index_failed > 0` | Medium |
| 8 | 펜딩 태스크 | `/_cluster/pending_tasks` | 지속적 증가 | Medium |

---

## 모니터링 주기 권장

| 지표 | 주기 |
|------|------|
| 클러스터 상태 | 10초 |
| Thread Pool Rejected | 10초 |
| 노드 상태 | 30초 |
| 힙 메모리 | 30초 |
| 디스크 사용량 | 1분 |
| 인덱싱/검색 통계 | 1분 |

---

## Kibana Dev Tools용 쿼리 모음

```bash
# ========================================
# 클러스터 상태
# ========================================

# 클러스터 전체 상태
GET /_cluster/health

# 인덱스별 상태
GET /_cluster/health?level=indices

# ========================================
# 노드 상태
# ========================================

# 노드 리소스 사용량
GET /_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role,master

# ========================================
# Thread Pool (rejected 확인)
# ========================================

# write/search thread pool
GET /_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected

# ========================================
# 디스크 사용량
# ========================================

# 노드별 디스크
GET /_cat/allocation?v&h=node,disk.percent,disk.avail

# ========================================
# 샤드 상태
# ========================================

# 전체 샤드
GET /_cat/shards?v&s=state

# UNASSIGNED 원인
GET /_cluster/allocation/explain

# ========================================
# 성능 통계
# ========================================

# 인덱싱/검색 통계
GET /_stats/indexing,search?filter_path=_all.total

# ========================================
# 펜딩 태스크
# ========================================

GET /_cluster/pending_tasks
```

---

## 참고

- [Elasticsearch Cluster Health API](https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html)
- [Cat APIs](https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html)
- [Nodes Stats API](https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html)