프로세스 종료

Introduction

프로세스의 생명주기에서 마지막 단계는 종료다. 프로세스가 실행을 마치거나 강제로 종료되면, 커널은 메모리와 파일 서술자 등 할당된 자원을 회수하고 부모 프로세스에 종료 사실을 알린다. 이 글에서는 리눅스 커널 v7.0의 소스 코드를 따라가며 프로세스 종료의 전체 과정을 살펴보고, 좀비 프로세스와 고아 프로세스가 발생하는 원인과 커널의 해결 메커니즘을 분석한다.

프로세스 종료 과정

프로세스 종료는 자발적 종료와 비자발적 종료로 나뉜다. 자발적 종료는 프로세스가 main() 함수에서 반환하거나 exit() 시스템 호출을 명시적으로 호출할 때 발생한다. 비자발적 종료는 처리할 수 없는 시그널¹을 수신하거나 복구 불가능한 예외가 발생했을 때 커널이 프로세스를 강제로 종료하는 경우다. 운영체제에 따라 부모 프로세스가 종료되면 자식 프로세스도 연쇄적으로 종료되기도 하는데Cascading Termination, 리눅스는 이 방식 대신 고아 프로세스에 새 부모를 지정하는 전략을 택한다.

어느 경로를 거치든 최종적으로는 kernel/exit.c에 정의된 do_exit() 함수에 도달한다. 이 함수가 프로세스 종료의 실질적인 작업을 수행하는 커널의 핵심 루틴이다.

`do_exit()`

아래 다이어그램은 do_exit() 함수의 핵심 단계를 요약한 것이다.

do_exit() 함수에서 일어나는 일을 큰 흐름으로 요약하면 다음과 같다.

커널 스레드인 경우 kthread_do_exit()를 통해 별도의 경로로 조기 종료한다. 커널 스레드는 사용자 공간 자원을 갖지 않으므로 일반 프로세스보다 경량화된 종료 경로를 따른다.²
exit_signals() 함수에서 task_struct의 flags에 PF_EXITING 플래그를 설정하여 프로세스가 종료 중임을 표시한다.
acct_update_integrals() 함수를 호출해 프로세스의 CPU 사용 시간과 메모리 사용량 등을 기록한다.³
atomic_dec_and_test(&tsk->signal->live)로 해당 태스크가 스레드 그룹의 마지막 생존 스레드인지 확인한다. 전역 init 프로세스(PID 1)의 마지막 스레드가 종료하는 상황이라면 커널 패닉을 일으킨다.
태스크의 종료 코드를 task_struct의 exit_code 멤버에 저장한다. 이 값은 부모 프로세스가 wait() 계열 시스템 호출로 수거할 수 있다.
exit_mm() 함수를 호출해 프로세스의 mm_struct를 해제한다. 다른 프로세스와 주소 공간을 공유하지 않는다면 해당 자원이 완전히 해제된다.
exit_sem(), exit_shm(), exit_files(), exit_fs() 함수를 순서대로 호출해 IPC 세마포어⁴, 공유 메모리, 파일 서술자, 파일 시스템 자원을 정리한다. 각 자원의 참조 횟수가 0이 되면 해당 자원을 시스템에 반환한다.
exit_notify() 함수를 호출해 부모 프로세스에 시그널을 보내고, 자식 프로세스의 새로운 부모를 지정하며, task_struct의 exit_state를 EXIT_ZOMBIE로 설정한다.
do_task_dead() 함수를 호출해 프로세스 상태를 TASK_DEAD로 변경하고 스케줄러에 제어를 넘긴다. 이 프로세스는 더 이상 스케줄링 대상이 아니므로, 이것이 종료되는 태스크가 실행하는 마지막 코드다.

void __noreturn do_exit(long code)
{
	struct task_struct *tsk = current;
	struct kthread *kthread;
	int group_dead;

	WARN_ON(irqs_disabled());
	WARN_ON(tsk->plug);

	kthread = tsk_is_kthread(tsk);
	if (unlikely(kthread))
		kthread_do_exit(kthread, code);

	kcov_task_exit(tsk);
	kmsan_task_exit(tsk);

	synchronize_group_exit(tsk, code);
	ptrace_event(PTRACE_EVENT_EXIT, code);
	user_events_exit(tsk);

	io_uring_files_cancel();
	sched_mm_cid_exit(tsk);
	exit_signals(tsk);  /* sets PF_EXITING */

	seccomp_filter_release(tsk);

	acct_update_integrals(tsk);
	group_dead = atomic_dec_and_test(&tsk->signal->live);
	if (group_dead) {
		/*
		 * If the last thread of global init has exited, panic
		 * immediately to get a useable coredump.
		 */
		if (unlikely(is_global_init(tsk)))
			panic("Attempted to kill init! exitcode=0x%08x\n",
				tsk->signal->group_exit_code ?: (int)code);

#ifdef CONFIG_POSIX_TIMERS
		hrtimer_cancel(&tsk->signal->real_timer);
		exit_itimers(tsk);
#endif
		if (tsk->mm)
			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
	}
	acct_collect(code, group_dead);
	if (group_dead)
		tty_audit_exit();
	audit_free(tsk);

	tsk->exit_code = code;
	taskstats_exit(tsk, group_dead);
	trace_sched_process_exit(tsk, group_dead);

	/*
	 * Since sampling can touch ->mm, make sure to stop everything before we
	 * tear it down.
	 *
	 * Also flushes inherited counters to the parent - before the parent
	 * gets woken up by child-exit notifications.
	 */
	perf_event_exit_task(tsk);
	/*
	 * PF_EXITING (above) ensures unwind_deferred_request() will no
	 * longer add new unwinds. While exit_mm() (below) will destroy the
	 * ability to do unwinds. So flush any pending unwinds here.
	 */
	unwind_deferred_task_exit(tsk);

	exit_mm();

	if (group_dead)
		acct_process();

	exit_sem(tsk);
	exit_shm(tsk);
	exit_files(tsk);
	exit_fs(tsk);
	if (group_dead)
		disassociate_ctty(1);
	exit_nsproxy_namespaces(tsk);
	exit_task_work(tsk);
	exit_thread(tsk);

	sched_autogroup_exit_task(tsk);
	cgroup_task_exit(tsk);

	/*
	 * FIXME: do that only when needed, using sched_exit tracepoint
	 */
	flush_ptrace_hw_breakpoint(tsk);

	exit_tasks_rcu_start();
	exit_notify(tsk, group_dead);
	proc_exit_connector(tsk);
	mpol_put_task_policy(tsk);
#ifdef CONFIG_FUTEX
	if (unlikely(current->pi_state_cache))
		kfree(current->pi_state_cache);
#endif
	/*
	 * Make sure we are holding no locks:
	 */
	debug_check_no_locks_held();

	if (tsk->io_context)
		exit_io_context(tsk);

	if (tsk->splice_pipe)
		free_pipe_info(tsk->splice_pipe);

	if (tsk->task_frag.page)
		put_page(tsk->task_frag.page);

	exit_task_stack_account(tsk);

	check_stack_usage();
	preempt_disable();
	if (tsk->nr_dirtied)
		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
	exit_rcu();
	exit_tasks_rcu_finish();

	lockdep_free_task(tsk);
	do_task_dead();
}

do_exit() 함수가 완료된 시점에서 태스크는 더 이상 실행 가능하지 않다. 이 시점에서 프로세스가 점유하고 있는 메모리라고는 커널 스택, thread_info 구조체, 그리고 task_struct 구조체가 전부다. 이 최소한의 정보는 부모 프로세스가 종료 상태를 수거할 수 있도록 커널이 유지하는 것이며, 부모가 wait()으로 종료 상태를 확인하거나 커널이 더 이상 해당 정보가 필요 없다고 판단하면 나머지 메모리도 반환된다.

좀비 프로세스

do_exit() 함수가 완료되면 exit_notify()에 의해 프로세스의 상태는 EXIT_ZOMBIE로 설정된다. 이처럼 실행은 종료되었지만 부모 프로세스가 wait() 계열 시스템 호출로 종료 상태를 아직 수거하지 않은 프로세스를 좀비 프로세스zombie process라 부른다.⁵

좀비 상태가 존재하는 이유는 UNIX 프로세스 모델의 설계 원칙 때문이다. 부모 프로세스는 자식의 종료 상태, 즉 정상 종료인지 시그널에 의한 종료인지, 종료 코드는 무엇인지를 확인할 권리가 있다. 커널은 부모가 이 정보를 수거할 때까지 최소한의 자료구조(task_struct, 커널 스택, thread_info)를 유지해야 한다. 문제는 부모 프로세스가 wait()을 호출하지 않는 경우다. 이때 좀비 프로세스는 영원히 시스템 메모리를 점유하게 되며, 대량으로 누적되면 PID 고갈이나 메모리 낭비로 이어질 수 있다.

부모 프로세스가 wait()을 호출하면 커널은 내부적으로 release_task() 함수를 통해 좀비 프로세스의 남은 자원을 완전히 해제한다. 여기서 일어나는 일은 대략 다음과 같다.

__exit_signal() 함수를 호출하고, 이 함수는 __unhash_process()를 호출하며, 이어서 detach_pid()에서 해당 프로세스를 pidhash와 태스크 리스트에서 제거한다.
__exit_signal() 함수는 종료된 프로세스가 사용하던 남은 자원을 반환하고, 통계값과 기타 정보를 기록한다.
해당 태스크가 스레드 그룹의 마지막 멤버 스레드였고 그룹 리더가 이미 좀비 상태라면, do_notify_parent()를 통해 그룹 리더의 부모에게 이 사실을 알린다. 부모가 SIGCHLD를 무시하는 경우에는 그룹 리더의 상태를 EXIT_DEAD로 변경하고 즉시 정리한다.
put_task_struct_rcu_user()를 호출해 프로세스의 커널 스택 및 thread_info 구조체가 들어있던 페이지를 반환하고, task_struct 구조체가 들어있던 슬랩 캐시를 반환한다.

void release_task(struct task_struct *p)
{
	struct release_task_post post;
	struct task_struct *leader;
	struct pid *thread_pid;
	int zap_leader;
repeat:
	memset(&post, 0, sizeof(post));

	/* don't need to get the RCU readlock here - the process is dead and
	 * can't be modifying its own credentials. */
	dec_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);

	pidfs_exit(p);
	cgroup_task_release(p);

	/* Retrieve @thread_pid before __unhash_process() may set it to NULL. */
	thread_pid = task_pid(p);

	write_lock_irq(&tasklist_lock);
	ptrace_release_task(p);
	__exit_signal(&post, p);

	/*
	 * If we are the last non-leader member of the thread
	 * group, and the leader is zombie, then notify the
	 * group leader's parent process. (if it wants notification.)
	 */
	zap_leader = 0;
	leader = p->group_leader;
	if (leader != p && thread_group_empty(leader)
			&& leader->exit_state == EXIT_ZOMBIE) {
		/* for pidfs_exit() and do_notify_parent() */
		if (leader->signal->flags & SIGNAL_GROUP_EXIT)
			leader->exit_code = leader->signal->group_exit_code;
		/*
		 * If we were the last child thread and the leader has
		 * exited already, and the leader's parent ignores SIGCHLD,
		 * then we are the one who should release the leader.
		 */
		zap_leader = do_notify_parent(leader, leader->exit_signal);
		if (zap_leader)
			leader->exit_state = EXIT_DEAD;
	}

	write_unlock_irq(&tasklist_lock);
	/* @thread_pid can't go away until free_pids() below */
	proc_flush_pid(thread_pid);
	exit_cred_namespaces(p);
	add_device_randomness(&p->se.sum_exec_runtime,
			      sizeof(p->se.sum_exec_runtime));
	free_pids(post.pids);
	release_thread(p);
	/*
	 * This task was already removed from the process/thread/pid lists
	 * and lock_task_sighand(p) can't succeed. Nobody else can touch
	 * ->pending or, if group dead, signal->shared_pending. We can call
	 * flush_sigqueue() lockless.
	 */
	flush_sigqueue(&p->pending);
	if (thread_group_leader(p))
		flush_sigqueue(&p->signal->shared_pending);

	put_task_struct_rcu_user(p);

	p = leader;
	if (unlikely(zap_leader))
		goto repeat;
}

release_task() 함수가 종료되면 프로세스 서술자를 포함한 모든 자원이 해제되어, 해당 프로세스의 흔적이 시스템에서 완전히 사라진다.

고아 프로세스

반대로 부모 프로세스가 먼저 종료된 경우, 자식 프로세스는 고아 프로세스orphan process가 되어 홀로 남겨진다. 이때 부모를 새로 지정하지 않으면 자식 프로세스가 나중에 종료되더라도 wait()을 호출할 부모가 없으므로 좀비 상태에서 영원히 빠져나올 수 없다.

리눅스는 이 문제를 do_exit() → exit_notify() → forget_original_parent() → find_new_reaper()의 호출 체인으로 해결한다. find_new_reaper() 함수는 다음 우선순위에 따라 새 부모를 탐색한다.

종료하는 프로세스가 속한 스레드 그룹에서 살아 있는 다른 스레드를 찾는다.
조상 프로세스 중 prctl(PR_SET_CHILD_SUBREAPER)로 자신을 child subreaper로 등록한 프로세스를 찾는다.⁶
위 두 조건에 해당하는 프로세스가 없으면, 해당 PID 네임스페이스의 init 프로세스(PID 1)를 새 부모로 지정한다.

/*
 * When we die, we re-parent all our children, and try to:
 * 1. give them to another thread in our thread group, if such a member exists
 * 2. give it to the first ancestor process which prctl'd itself as a
 *    child_subreaper for its children (like a service manager)
 * 3. give it to the init process (PID 1) in our pid namespace
 */
static struct task_struct *find_new_reaper(struct task_struct *father,
					   struct task_struct *child_reaper)
{
	struct task_struct *thread, *reaper;

	thread = find_alive_thread(father);
	if (thread)
		return thread;

	if (father->signal->has_child_subreaper) {
		unsigned int ns_level = task_pid(father)->level;
		/*
		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
		 * We can't check reaper != child_reaper to ensure we do not
		 * cross the namespaces, the exiting parent could be injected
		 * by setns() + fork().
		 * We check pid->level, this is slightly more efficient than
		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
		 */
		for (reaper = father->real_parent;
		     task_pid(reaper)->level == ns_level;
		     reaper = reaper->real_parent) {
			if (reaper == &init_task)
				break;
			if (!reaper->signal->is_child_subreaper)
				continue;
			thread = find_alive_thread(reaper);
			if (thread)
				return thread;
		}
	}

	return child_reaper;
}

find_new_reaper()가 새 부모를 결정하면, forget_original_parent() 함수가 실질적인 재배치를 수행한다. 이 함수는 종료하는 프로세스의 모든 자식을 순회하며 real_parent 포인터를 새 부모로 변경한다. 자식 프로세스가 pdeath_signal⁷을 설정해 두었다면 해당 시그널을 보내고, 재배치된 자식 중 이미 좀비 상태인 프로세스가 있다면 새 부모에게 통지한다.

/*
 * Make init inherit all the child processes
 */
static void forget_original_parent(struct task_struct *father,
				struct list_head *dead)
{
	struct task_struct *p, *t, *reaper;

	if (unlikely(!list_empty(&father->ptraced)))
		exit_ptrace(father, dead);

	/* Can drop and reacquire tasklist_lock */
	reaper = find_child_reaper(father, dead);
	if (list_empty(&father->children))
		return;

	reaper = find_new_reaper(father, reaper);
	list_for_each_entry(p, &father->children, sibling) {
		for_each_thread(p, t) {
			RCU_INIT_POINTER(t->real_parent, reaper);
			BUG_ON((!t->ptrace) != (rcu_access_pointer(t->parent) == father));
			if (likely(!t->ptrace))
				t->parent = t->real_parent;
			if (t->pdeath_signal)
				group_send_sig_info(t->pdeath_signal,
						    SEND_SIG_NOINFO, t,
						    PIDTYPE_TGID);
		}
		/*
		 * If this is a threaded reparent there is no need to
		 * notify anyone anything has happened.
		 */
		if (!same_thread_group(reaper, father))
			reparent_leader(father, p, dead);
	}
	list_splice_tail_init(&father->children, &reaper->children);
}

forget_original_parent() 함수 내부에서는 exit_ptrace() 함수도 호출한다. ptrace⁸로 추적 중이던 자식 프로세스가 있다면 추적을 해제하고 부모를 재지정하는데, PT_EXITKILL 플래그가 설정된 추적 대상에게는 SIGKILL을 보내 함께 종료시킨다.

/*
 * Detach all tasks we were using ptrace on. Called with tasklist held
 * for writing.
 */
void exit_ptrace(struct task_struct *tracer, struct list_head *dead)
{
	struct task_struct *p, *n;

	list_for_each_entry_safe(p, n, &tracer->ptraced, ptrace_entry) {
		if (unlikely(p->ptrace & PT_EXITKILL))
			send_sig_info(SIGKILL, SEND_SIG_PRIV, p);

		if (__ptrace_detach(tracer, p))
			list_add(&p->ptrace_entry, dead);
	}
}

이 모든 과정을 조율하는 것이 exit_notify() 함수다. 이 함수는 먼저 forget_original_parent()로 자식 프로세스를 재배치하고, 프로세스 그룹 전체가 종료되는 상황이면 kill_orphaned_pgrp()를 통해 고아가 된 프로세스 그룹에 시그널을 보낸다. 그 다음 자신의 상태를 EXIT_ZOMBIE로 설정하고 부모에게 통지한다. 만약 부모가 SIGCHLD를 무시하거나 추적 중이 아닌 비리더 스레드라면 즉시 EXIT_DEAD로 전환하여 자동 정리autoreap한다.

/*
 * Send signals to all our closest relatives so that they know
 * to properly mourn us..
 */
static void exit_notify(struct task_struct *tsk, int group_dead)
{
	bool autoreap;
	struct task_struct *p, *n;
	LIST_HEAD(dead);

	write_lock_irq(&tasklist_lock);
	forget_original_parent(tsk, &dead);

	if (group_dead)
		kill_orphaned_pgrp(tsk->group_leader, NULL);

	tsk->exit_state = EXIT_ZOMBIE;

	if (unlikely(tsk->ptrace)) {
		int sig = thread_group_leader(tsk) &&
				thread_group_empty(tsk) &&
				!ptrace_reparented(tsk) ?
			tsk->exit_signal : SIGCHLD;
		autoreap = do_notify_parent(tsk, sig);
	} else if (thread_group_leader(tsk)) {
		autoreap = thread_group_empty(tsk) &&
			do_notify_parent(tsk, tsk->exit_signal);
	} else {
		autoreap = true;
		/* untraced sub-thread */
		do_notify_pidfd(tsk);
	}

	if (autoreap) {
		tsk->exit_state = EXIT_DEAD;
		list_add(&tsk->ptrace_entry, &dead);
	}

	/* mt-exec, de_thread() is waiting for group leader */
	if (unlikely(tsk->signal->notify_count < 0))
		wake_up_process(tsk->signal->group_exec_task);
	write_unlock_irq(&tasklist_lock);

	list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
		list_del_init(&p->ptrace_entry);
		release_task(p);
	}
}

마지막으로 tasklist_lock을 해제하고, release_task() 함수를 통해 dead 리스트에 있는 모든 태스크를 순회하며 정리한다. 이 과정이 끝나면 고아 프로세스의 재배치와 관련된 모든 절차가 완료된다.

출처

Silberschatz, A., Galvin, P. B., & Gagne, G. (2018). Operating System Concepts (10th ed.). John Wiley & Sons.
Love, R. (2010). Linux Kernel Development (3rd ed.). Addison-Wesley Professional.
Bryant, R. E., & O'Hallaron, D. R. (2016). Computer Systems: A Programmer's Perspective (3rd ed.). Pearson.
Kerrisk, M. (2018). The Linux Programming Interface (9th printing). No Starch Press.
Linux kernel source code v7.0: kernel/exit.c, kernel/ptrace.c

SIGKILL이나 처리 핸들러가 등록되지 않은 SIGSEGV 등이 이에 해당한다. SIGKILL과 SIGSTOP은 프로세스가 잡거나 무시할 수 없는 시그널이다. ↩
커널 v7.0에서 추가된 분기다. 커널 스레드는 사용자 공간의 주소 공간(mm_struct)이나 파일 서술자를 갖지 않으므로, exit_mm() · exit_files() 등의 정리 과정을 건너뛸 수 있다. ↩
BSD 방식의 프로세스 정보 기록process accounting 기능이 활성화된 경우(CONFIG_BSD_PROCESS_ACCT), 프로세스의 CPU 사용 시간과 메모리 사용량 등을 로그 파일에 기록한다. ↩
공유 자원에 대한 접근을 제어하기 위한 동기화 프리미티브로, 음이 아닌 정수와 P(wait)/V(signal) 연산으로 구성된다. ↩
ps 명령어에서 상태가 Z로 표시되는 프로세스가 좀비 프로세스다. ps aux | grep Z로 시스템의 좀비 프로세스를 확인할 수 있다. ↩
prctl(PR_SET_CHILD_SUBREAPER, 1)을 호출한 프로세스는 자신의 하위 프로세스 트리에서 고아가 발생했을 때 init 대신 새 부모 역할을 맡는다. systemd가 대표적인 예시다. Linux v3.4에서 도입되었다. ↩
prctl(PR_SET_PDEATHSIG, sig)로 설정하는 시그널로, 부모 프로세스가 종료될 때 자식 프로세스에 전달된다. 부모의 죽음을 감지하여 스스로 정리하거나 종료해야 하는 데몬 프로세스에서 주로 사용한다. ↩
한 프로세스가 다른 프로세스의 실행을 추적하고 제어할 수 있게 해주는 시스템 호출이다. 디버거(GDB 등)가 브레이크포인트 설정, 레지스터 검사, 메모리 읽기/쓰기 등의 작업을 수행할 때 사용한다. ↩