【容器底层技术】 namespaces详解-Toy模板网

这篇具有很好参考价值的文章主要介绍了【容器底层技术】 namespaces详解。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

【容器底层技术】 namespaces详解

1.简介

namespaces是 Linux 内核的一项功能，它对内核资源进行隔离，让一组进程只能看到与自己相关的一部分资源，而另一组进程看到另一组资源，使得处于不同 namespaces 的进程拥有独立的全局系统资源，改变一个 namespaces 中的系统资源只会影响当前 namespaces 里的进程，对其他 namespaces 中的进程没有影响。

该功能通过为一组资源和进程使用相同的namespace来工作，不同的namespace会引用不同的资源，有些资源可能存在于多个空间中，比如进程 ID、主机名、用户 ID、文件名以及与网络访问和进程间通信相关的一些名称。

namespaces是 Linux 上容器的一个基本功能。Linux 系统以每种类型的单个namespace开始，供所有进程使用，进程可以创建额外的namespaces并加入不同的namespaces。

2.种类

截至至内核5.6，namespaces一共有8种。各个类型的namespace作用方式都是一样的：每个进程都和一个namespace相关联，而且只能看到或使用由这个namespace和它可用的子代namespaces所关联的资源，通过这种方法，每个进程对系统资源都有一个不同的视角。哪种资源被隔离取决于为给定进程组创建的namespace类型。

【容器底层技术】 namespaces详解

overview

type	flag	function
mnt	CLONE_NEWNS	控制挂载点
pid	CLONE_NEWPID	为进程提供了一组独立于其他namespaces的进程 IDs (PIDs)
net	CLONE_NEWNET	用来隔离网络设备、IP地址端口等网络栈的namespace
uts	CLONE_NEWUTS	允许单个系统对不同的进程具有不同的主机名和域名
user	CLONE_NEWUSER	跨多组进程提供权限隔离和用户身份隔离
ipc	CLONE_NEWIPC	将进程与 SysV 风格的进程间通信隔离开来
cgroup	CLONE_NEWCGROUP	隐藏了进程所属的控制组的身份
time	CLONE_NEWTIME	允许不同进程看到不同的系统时间

Mount(mnt) namespace

Mount namespace用来控制挂载点。不同namespace中的进程看到的文件系统层次也是不一样的。在mount namespace中调用mount(), unmount()只会影响当前namespace内的文件系统。在创建时，当前mount namespace中的挂载点被复制到新命名空间，但之后创建的挂载点不会在namespaces之间传播（如果使用共享子树，可以在命名空间之间传播挂载点）。

用于创建这种类型的新命名空间的clone flag是 CLONE_NEWNS - “NEW NameSpace”的缩写。这个术语不是描述性的（无法从名字看出要创建哪种命名空间），因为挂载命名空间是第一种命名空间，设计人员没有预料到还有其他命名空间。

Process ID(pid) namespace

PID namespace为进程提供了一组独立于其他namespaces的进程 IDs (PIDs)。 PID namespace是嵌套的，这意味着当创建一个新进程时，它将拥有从当前namespace到初始 PID namespace所对应的每一个 PID。因此初始 PID namespace能够看到所有进程，尽管看到的进程PID与其他namespace看到的不同。

在 PID namespace中创建的第一个进程被分配了 1 号进程 ID，并接受与普通 init 进程基本相同的特殊处理，其中最值得注意的是此namespace内的所有孤儿进程都会附加给它（孤儿进程，Orphan Process，指的是在其父进程执行完成或被终止后仍继续运行的一类进程）这一过程被称为“收养”。这也意味着 PID 1 进程的终止也将立即终止其 PID 命名空间中的所有进程和任何后代。

Network(net) namespace

Network namespace是用来隔离网络设备、IP地址端口等网络栈的namespace。每个网络接口（物理或虚拟）只存在于 1 个namespace中，并且可以在namespace之间移动。每个namespace都有一组私有 IP 地址、它自己的路由表、套接字列表、连接跟踪表、防火墙和其他与网络相关的资源。Network namespace可以让进程拥有自己独立的（虚拟的）网络设备，每个namespace内的端口都不会冲突。

UTS namespace

UTS（UNIX 分时）namespace允许单个系统对不同的进程具有不同的主机名和域名。当一个进程创建一个新的 UTS namespace时，新 UTS namespace的主机名和域是从调用者的 UTS namespace中的相应值复制而来的。

User ID (user) namespace

user namespace是内核3.8正式推出的一项功能，可跨多组进程提供权限隔离和用户身份隔离。借助管理协助，可以构建具有看似管理权限的容器，而无需实际授予用户进程提升的权限。与 PID 命名空间一样，user namespace是嵌套的，每个新的user namespace都被认为是创建它的user namespace的子级。

user namespace包含一个映射表，将用户 ID 从容器的角度转换为系统的角度。例如，这允许 root 用户在容器中拥有用户 id 0，但实际上系统将其视为用户id 1,400,000以进行所有权检查。类似的表用于组 ID 映射和所有权检查。

为了便于管理操作的权限隔离，一般认为每个namespace类型都由一个user namespace所拥有，这个user namespace是基于创建时的活动的user namespace。在适当的user namespace中具有管理权限的用户将被允许在该其他namespace类型中执行管理操作。例如，如果一个进程具有更改网络接口 IP 地址的管理权限，只要它自己的user namespace与拥有该net namespace的user namespace相同（或祖先），它就可以这样做。因此初始user namespace对系统中的所有namespace类型具有管理控制权。

Interprocess Communication (ipc) namespace

IPC namespace将进程与 SysV 风格的进程间通信隔离开来。这可以防止不同 IPC namespace中的进程使用例如 SHM 系列函数在两个进程之间建立共享内存范围。相反，每个进程将能够对共享内存区域使用相同的标识符并产生两个这样的不同区域。

Control group (cgroup) namespace

cgroup namespace类型隐藏了进程所属的控制组的身份。在这样的命名空间中的一个进程，在检查任何进程属于哪个控制组时会看到一条实际上相对于创建时设置的控制组的路径，隐藏其真正的控制组位置和身份。这种命名空间类型自 2016 年 3 月起在 Linux 4.6 中存在。

Time namespace

time namespace使用类似UTS namespace的方法允许不同进程看到不同的系统时间

3. 操控

Linux内核在/proc/<pid>/ns/中为每个进程与每个namespace类型指定一个符号链接（symbolic link）。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-35Pe2t3L-1657261001794)(.\QQ截图20220616162848.png)]

这些符号链接的格式为xxx:[inode number]，其中的 xxx 为 namespace 的类型，inode number 则用来标识一个 namespace，此符号链接指向的inode number对于此namespace中的每个进程都是相同的。这样通过其符号链接之一指向的 inode 编号可以唯一地标识每个命名空间。

通过以下三个系统调用可以直接操控namespaces

clone：通过flags指定新进程应该被迁移到哪些新的namespaces中

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
		 int __user *, parent_tidptr,
		 unsigned long, tls,
		 int __user *, child_tidptr)
{
	struct kernel_clone_args args = {
		.flags		= (lower_32_bits(clone_flags) & ~CSIGNAL),
		.pidfd		= parent_tidptr,
		.child_tid	= child_tidptr,
		.parent_tid	= parent_tidptr,
		.exit_signal	= (lower_32_bits(clone_flags) & CSIGNAL),
		.stack		= newsp,
		.tls		= tls,
	};

	return kernel_clone(&args);
}

pid_t kernel_clone(struct kernel_clone_args *args)
{
	u64 clone_flags = args->flags;
	struct completion vfork;
	struct pid *pid;
	struct task_struct *p;
	int trace = 0;
	pid_t nr;

	/*
	 * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
	 * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
	 * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
	 * field in struct clone_args and it still doesn't make sense to have
	 * them both point at the same memory location. Performing this check
	 * here has the advantage that we don't need to have a separate helper
	 * to check for legacy clone().
	 */
	if ((args->flags & CLONE_PIDFD) &&
	    (args->flags & CLONE_PARENT_SETTID) &&
	    (args->pidfd == args->parent_tid))
		return -EINVAL;

	/*
	 * Determine whether and which event to report to ptracer.  When
	 * called from kernel_thread or CLONE_UNTRACED is explicitly
	 * requested, no event is reported; otherwise, report if the event
	 * for the type of forking is enabled.
	 */
	if (!(clone_flags & CLONE_UNTRACED)) {
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if (args->exit_signal != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
	add_latent_entropy();

	if (IS_ERR(p))
		return PTR_ERR(p);

	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	trace_sched_process_fork(current, p);

	pid = get_task_pid(p, PIDTYPE_PID);
	nr = pid_vnr(pid);

	if (clone_flags & CLONE_PARENT_SETTID)
		put_user(nr, args->parent_tid);

	if (clone_flags & CLONE_VFORK) {
		p->vfork_done = &vfork;
		init_completion(&vfork);
		get_task_struct(p);
	}

	wake_up_new_task(p);

	/* forking complete and child started to run, tell ptracer */
	if (unlikely(trace))
		ptrace_event_pid(trace, pid);

	if (clone_flags & CLONE_VFORK) {
		if (!wait_for_vfork_done(p, &vfork))
			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
	}

	put_pid(pid);
	return nr;
}

unshare：允许进程（或线程）解除当前与其他进程（或线程）共享的部分执行上下文的关联，主要用于使一个进程不需要创建一个新的进程就可以控制它所共享的执行上下文，相当于跳出原来的namespaces，加入到新的namespaces中。

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
	return ksys_unshare(unshare_flags);
}

/*
 * unshare allows a process to 'unshare' part of the process
 * context which was originally shared using clone.  copy_*
 * functions used by kernel_clone() cannot be used here directly
 * because they modify an inactive task_struct that is being
 * constructed. Here we are modifying the current, active,
 * task_struct.
 */
int ksys_unshare(unsigned long unshare_flags)
{
	struct fs_struct *fs, *new_fs = NULL;
	struct files_struct *new_fd = NULL;
	struct cred *new_cred = NULL;
	struct nsproxy *new_nsproxy = NULL;
	int do_sysvsem = 0;
	int err;

	/*
	 * If unsharing a user namespace must also unshare the thread group
	 * and unshare the filesystem root and working directories.
	 */
	if (unshare_flags & CLONE_NEWUSER)
		unshare_flags |= CLONE_THREAD | CLONE_FS;
	/*
	 * If unsharing vm, must also unshare signal handlers.
	 */
	if (unshare_flags & CLONE_VM)
		unshare_flags |= CLONE_SIGHAND;
	/*
	 * If unsharing a signal handlers, must also unshare the signal queues.
	 */
	if (unshare_flags & CLONE_SIGHAND)
		unshare_flags |= CLONE_THREAD;
	/*
	 * If unsharing namespace, must also unshare filesystem information.
	 */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;

	err = check_unshare_flags(unshare_flags);
	if (err)
		goto bad_unshare_out;
	/*
	 * CLONE_NEWIPC must also detach from the undolist: after switching
	 * to a new ipc namespace, the semaphore arrays from the old
	 * namespace are unreachable.
	 */
	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
		do_sysvsem = 1;
	err = unshare_fs(unshare_flags, &new_fs);
	if (err)
		goto bad_unshare_out;
	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
	if (err)
		goto bad_unshare_cleanup_fs;
	err = unshare_userns(unshare_flags, &new_cred);
	if (err)
		goto bad_unshare_cleanup_fd;
	err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy,
					 new_cred, new_fs);
	if (err)
		goto bad_unshare_cleanup_cred;

	if (new_cred) {
		err = set_cred_ucounts(new_cred);
		if (err)
			goto bad_unshare_cleanup_cred;
	}

	if (new_fs || new_fd || do_sysvsem || new_cred || new_nsproxy) {
		if (do_sysvsem) {
			/*
			 * CLONE_SYSVSEM is equivalent to sys_exit().
			 */
			exit_sem(current);
		}
		if (unshare_flags & CLONE_NEWIPC) {
			/* Orphan segments in old ns (see sem above). */
			exit_shm(current);
			shm_init_task(current);
		}

		if (new_nsproxy)
			switch_task_namespaces(current, new_nsproxy);

		task_lock(current);

		if (new_fs) {
			fs = current->fs;
			spin_lock(&fs->lock);
			current->fs = new_fs;
			if (--fs->users)
				new_fs = NULL;
			else
				new_fs = fs;
			spin_unlock(&fs->lock);
		}

		if (new_fd)
			swap(current->files, new_fd);

		task_unlock(current);

		if (new_cred) {
			/* Install the new user namespace */
			commit_creds(new_cred);
			new_cred = NULL;
		}
	}

	perf_event_namespaces(current);

bad_unshare_cleanup_cred:
	if (new_cred)
		put_cred(new_cred);
bad_unshare_cleanup_fd:
	if (new_fd)
		put_files_struct(new_fd);

bad_unshare_cleanup_fs:
	if (new_fs)
		free_fs_struct(new_fs);

bad_unshare_out:
	return err;
}

setns：通过一个文件描述符进入特定的namespace中

 SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 {
     struct task_struct *tsk = current;
     struct nsproxy *new_nsproxy;
     struct file *file;
     struct ns_common *ns;
     int err;

     file = proc_ns_fget(fd);

     ns = get_proc_ns(file_inode(file));
     if (nstype && (ns->ops->type != nstype))
         goto out;

     new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);

     err = ns->ops->install(new_nsproxy, ns);

     switch_task_namespaces(tsk, new_nsproxy);
     out:
     fput(file);
     return err;
 }

当一个namespace不再被引用时，它会被删除，对所包含资源的处理则取决于namespace的类型。namespace可通过以下三种方式被引用文章来源地址https://www.toymoban.com/news/detail-465193.html

通过属于该namespace的进程
通过打开该namespace的文件描述符，/proc/<pid>/ns/<ns-kind>
namespace文件的绑定挂载，/proc/<pid>/ns/<ns-kind>

4. 源码定义

/*
 * A structure to contain pointers to all per-process
 * namespaces - fs (mount), uts, network, sysvipc, etc.
 *
 * The pid namespace is an exception -- it's accessed using
 * task_active_pid_ns.  The pid namespace here is the
 * namespace that children will use.
 *
 * 'count' is the number of tasks holding a reference.
 * The count for each namespace, then, will be the number
 * of nsproxies pointing to it, not the number of tasks.
 *
 * The nsproxy is shared by tasks which share all namespaces.
 * As soon as a single namespace is cloned or unshared, the
 * nsproxy is copied.
 */
struct nsproxy {
	atomic_t count;
	struct uts_namespace *uts_ns;
	struct ipc_namespace *ipc_ns;
	struct mnt_namespace *mnt_ns;
	struct pid_namespace *pid_ns_for_children;
	struct net 	     *net_ns;
	struct time_namespace *time_ns;
	struct time_namespace *time_ns_for_children;
	struct cgroup_namespace *cgroup_ns;
};

/* nsproxy中没有user_namespace变量，是因为user_namespace比较特殊，它用于进程的认证
 * 而nsproxy会被所有共享namespaces的进程共享，
 * 在struct task_struct的成员变量cred中有 struct user_namespace user_ns用于身份识别
 */
 struct user_namespace {
     struct uid_gid_map	uid_map;
     struct uid_gid_map	gid_map;
     struct uid_gid_map	projid_map;
     atomic_t		count;
     struct user_namespace	*parent;
     int			level;
     kuid_t			owner;
     kgid_t			group;
     struct ns_common	ns;
     unsigned long		flags;

     /* Register of per-UID persistent keyrings for this namespace */
     #ifdef CONFIG_PERSISTENT_KEYRINGS
     struct key		*persistent_keyring_register;
     struct rw_semaphore	persistent_keyring_register_sem;
     #endif
 };

struct mnt_namespace {
	atomic_t		count;
	struct ns_common	ns;
	struct mount *	root;
	struct list_head	list;
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	u64			seq;	/* Sequence number to prevent loops */
	wait_queue_head_t poll;
	u64 event;
	unsigned int		mounts; /* # of mounts in the namespace */
	unsigned int		pending_mounts;
}

struct uts_namespace {
	struct kref kref;
	struct new_utsname name;
	struct user_namespace *user_ns;
	struct ucounts *ucounts;
	struct ns_common ns;
}

struct ipc_namespace {
	refcount_t	count;
	struct ipc_ids	ids[3];

	int		sem_ctls[4];
	int		used_sems;

	unsigned int	msg_ctlmax;
	unsigned int	msg_ctlmnb;
	unsigned int	msg_ctlmni;
	atomic_t	msg_bytes;
	atomic_t	msg_hdrs;

	size_t		shm_ctlmax;
	size_t		shm_ctlall;
	unsigned long	shm_tot;
	int		shm_ctlmni;
	/*
	 * Defines whether IPC_RMID is forced for _all_ shm segments regardless
	 * of shmctl()
	 */
	int		shm_rmid_forced;

	struct notifier_block ipcns_nb;

	/* The kern_mount of the mqueuefs sb.  We take a ref on it */
	struct vfsmount	*mq_mnt;

	/* # queues in this ns, protected by mq_lock */
	unsigned int    mq_queues_count;

	/* next fields are set through sysctl */
	unsigned int    mq_queues_max;   /* initialized to DFLT_QUEUESMAX */
	unsigned int    mq_msg_max;      /* initialized to DFLT_MSGMAX */
	unsigned int    mq_msgsize_max;  /* initialized to DFLT_MSGSIZEMAX */
	unsigned int    mq_msg_default;
	unsigned int    mq_msgsize_default;

	/* user_ns which owns the ipc ns */
	struct user_namespace *user_ns;
	struct ucounts *ucounts;

	struct ns_common ns;
}

struct pid_namespace {
	struct kref kref;
	struct idr idr;
	struct rcu_head rcu;
	unsigned int pid_allocated;
	struct task_struct *child_reaper;
	struct kmem_cache *pid_cachep;
	unsigned int level;
	struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
	struct vfsmount *proc_mnt;
	struct dentry *proc_self;
	struct dentry *proc_thread_self;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
	struct fs_pin *bacct;
#endif
	struct user_namespace *user_ns;
	struct ucounts *ucounts;
	struct work_struct proc_work;
	kgid_t pid_gid;
	int hide_pid;
	int reboot;	/* group exit code if this pidns was rebooted */
	struct ns_common ns;
}

struct cgroup_namespace {
	refcount_t		count;
	struct ns_common	ns;
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	struct css_set          *root_cset;
};

struct net {
	refcount_t		passive;	/* To decided when the network
						 * namespace should be freed.
						 */
	refcount_t		count;		/* To decided when the network
						 *  namespace should be shut down.
						 */
	spinlock_t		rules_mod_lock;

	atomic64_t		cookie_gen;

	struct list_head	list;		/* list of network namespaces */
	struct list_head	exit_list;	/* To linked to call pernet exit
						 * methods on dead net (
						 * pernet_ops_rwsem read locked),
						 * or to unregister pernet ops
						 * (pernet_ops_rwsem write locked).
						 */
	struct llist_node	cleanup_list;	/* namespaces on death row */

	struct user_namespace   *user_ns;	/* Owning user namespace */
	struct ucounts		*ucounts;
	spinlock_t		nsid_lock;
	struct idr		netns_ids;

	struct ns_common	ns;

	struct proc_dir_entry 	*proc_net;
	struct proc_dir_entry 	*proc_net_stat;

#ifdef CONFIG_SYSCTL
	struct ctl_table_set	sysctls;
#endif

	struct sock 		*rtnl;			/* rtnetlink socket */
	struct sock		*genl_sock;

	struct uevent_sock	*uevent_sock;		/* uevent socket */

	struct list_head 	dev_base_head;
	struct hlist_head 	*dev_name_head;
	struct hlist_head	*dev_index_head;
	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
	int			ifindex;
	unsigned int		dev_unreg_count;

	/* core fib_rules */
	struct list_head	rules_ops;

	struct list_head	fib_notifier_ops;  /* Populated by
						    * register_pernet_subsys()
						    */
	struct net_device       *loopback_dev;          /* The loopback */
	struct netns_core	core;
	struct netns_mib	mib;
	struct netns_packet	packet;
	struct netns_unix	unx;
	struct netns_ipv4	ipv4;
#if IS_ENABLED(CONFIG_IPV6)
	struct netns_ipv6	ipv6;
#endif
#if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
	struct netns_ieee802154_lowpan	ieee802154_lowpan;
#endif
#if defined(CONFIG_IP_SCTP) || defined(CONFIG_IP_SCTP_MODULE)
	struct netns_sctp	sctp;
#endif
#if defined(CONFIG_IP_DCCP) || defined(CONFIG_IP_DCCP_MODULE)
	struct netns_dccp	dccp;
#endif
#ifdef CONFIG_NETFILTER
	struct netns_nf		nf;
	struct netns_xt		xt;
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
	struct netns_ct		ct;
#endif
#if defined(CONFIG_NF_TABLES) || defined(CONFIG_NF_TABLES_MODULE)
	struct netns_nftables	nft;
#endif
#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
	struct netns_nf_frag	nf_frag;
	struct ctl_table_header *nf_frag_frags_hdr;
#endif
	struct sock		*nfnl;
	struct sock		*nfnl_stash;
#if IS_ENABLED(CONFIG_NETFILTER_NETLINK_ACCT)
	struct list_head        nfnl_acct_list;
#endif
#if IS_ENABLED(CONFIG_NF_CT_NETLINK_TIMEOUT)
	struct list_head	nfct_timeout_list;
#endif
#endif
#ifdef CONFIG_WEXT_CORE
	struct sk_buff_head	wext_nlevents;
#endif
	struct net_generic __rcu	*gen;

	struct bpf_prog __rcu	*flow_dissector_prog;

	/* Note : following structs are cache line aligned */
#ifdef CONFIG_XFRM
	struct netns_xfrm	xfrm;
#endif
#if IS_ENABLED(CONFIG_IP_VS)
	struct netns_ipvs	*ipvs;
#endif
#if IS_ENABLED(CONFIG_MPLS)
	struct netns_mpls	mpls;
#endif
#if IS_ENABLED(CONFIG_CAN)
	struct netns_can	can;
#endif
	struct sock		*diag_nlsk;
	atomic_t		fnhe_genid;
}

struct time_namespace {
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	struct ns_common	ns;
	struct timens_offsets	offsets;
	struct page		*vvar_page;
	/* If set prevents changing offsets after any task joined namespace. */
	bool			frozen_offsets;
}