前言

最近遇到一个奇怪的问题，偶尔/tmp目录满了之后，一段时间内就一定会导致/tmp/resolv.conf.d/resolv.conf.auto文件被清空，dnsmasq无法找到上级转发DNS导致断网。但是查看文件修改时间，如果可用空间一直充足，这个文件的修改时间也不会发生变化；一旦没有可用空间，这个文件就会被替换为空文件。

谁动了reslove.conf

首先确认下该文件的同于是dnsmasq用于确认转发DNS请求的上级服务器，在luci中可以设定。

既然是写入luci，目标那基本就是netifd这个库；这个库主要职责就是OpenWrt中绝大多数和网络相关配置。

通过搜索我们可以确认__interface_write_dns_entries这个函数负责实际的写入resolve；通过下面其唯一调用者interface_write_resolv_conf的代码，我们就可以确认这件事的元凶。

interface_write_resolv_conf(const char *jail)
{
    size_t plen = (jail ? strlen(jail) + 1 : 0 ) +
        (strlen(resolv_conf) >= strlen(DEFAULT_RESOLV_CONF) ?
        strlen(resolv_conf) : strlen(DEFAULT_RESOLV_CONF) ) + 1;
    char *path = alloca(plen);
    char *dpath = alloca(plen);
    char *tmppath = alloca(plen + 4);
    FILE *f;
    uint32_t crcold, crcnew;

    if (jail) {
        sprintf(path, "/tmp/resolv.conf-%s.d/resolv.conf.auto", jail);
        strcpy(dpath, path);
        dpath = dirname(dpath);
        mkdir(dpath, 0755);
    } else {
        strcpy(path, resolv_conf);
    }

    sprintf(tmppath, "%s.tmp", path);
    unlink(tmppath);
    f = fopen(tmppath, "w+");
    if (!f) {
        D(INTERFACE, "Failed to open %s for writing\n", path);
        return;
    }

    __interface_write_dns_entries(f, jail);

    fflush(f);
    rewind(f);
    crcnew = crc32_file(f);
    fclose(f);

    crcold = crcnew + 1;
    f = fopen(path, "r");
    if (f) {
        crcold = crc32_file(f);
        fclose(f);
    }

    if (crcold == crcnew) {
        unlink(tmppath);
    } else if (rename(tmppath, path) < 0) {
        D(INTERFACE, "Failed to replace %s\n", path);
        unlink(tmppath);
    }
}

代码主要行为就是在被调用时，在tmppath创建一个临时文件；通过对比临时文件和原文件的crc32，确认是否要替换为新文件。

这就解释了为什么当文件内容一致时，resolve文件的修改日期不会变化；而一旦可用空间不足时，新的临时文件创建后无法写入内容，最后生成了一个空间；这样crc32对比一定会不一致，从而导致老文件被替换为新的空文件。

谁触发了刷新resolve

到这里事情基本都理清楚了，但是还剩下一个问题：
什么情况下系统DNS会更新，为什么每隔10分钟左右系统就会更新resolve文件；一旦有空间，几分钟过后resolve文件就会恢复成正常状态。

反向查找后，发现interface_write_resolv_conf有如下几个调用者:

interface_proto_event_cb
interface_change_config
interface_ip_update_complete

其中interface_proto_event_cb可以排除，因为代码中只有在IFPEV_UP和IFPEV_DOWN以及IFPEV_LINK_LOST的情况下会触法刷新resolve文件的操作，而实际上并未有发生这类事情；根据设计文档中的说明也可以应证这一说法。

state:
  IFS_SETUP:
    The interface is currently being configured by the protocol handler
  IFS_UP:
    The interface is fully configured
  IFS_TEARDOWN:
    The interface is being deconfigured
  IFS_DOWN:
    The interface is down

而剩下的interface_change_config和interface_ip_update_complete看起来都比较可疑。

其中interface_change_config的唯一调用者是interface_update；而interface_update是作为回调函数在interface_init_list中使用的。

static void
interface_update(struct vlist_tree *tree, struct vlist_node *node_new,
         struct vlist_node *node_old)
{
    struct interface *if_old = container_of(node_old, struct interface, node);
    struct interface *if_new = container_of(node_new, struct interface, node);

    if (node_old && node_new) {
        D(INTERFACE, "Update interface '%s'\n", if_new->name);
        interface_change_config(if_old, if_new);
    } else if (node_old) {
        D(INTERFACE, "Remove interface '%s'\n", if_old->name);
        set_config_state(if_old, IFC_REMOVE);
    } else if (node_new) {
        D(INTERFACE, "Create interface '%s'\n", if_new->name);
        interface_event(if_new, IFEV_CREATE);
        proto_init_interface(if_new, if_new->config);
        interface_claim_device(if_new);
        netifd_ubus_add_interface(if_new);
    }
}

static void __init
interface_init_list(void)
{
    vlist_init(&interfaces, avl_strcmp, interface_update);
    interfaces.keep_old = true;
    interfaces.no_delete = true;
}

也就是当interfaces发生变更（创建/删除/更新）时，会被调用，这个更新就很让人在意，但是文档中也没写什么算作更新，我暂且在蒙古里。

视线转向另外一个函数interface_ip_update_complete，其调用者有：

config_init_ip
interface_update_complete

其中config_init_ip的唯一调用者是config_init_all，这个函数是作为初始化所有interfaces来使用的；只有在main和netifd_reload会调用。在检查日志和pid后，确认netifd并未重启，看起来似乎也没reload的痕迹，interfaces似乎也没有重置，我们可以基本排除这条调用栈。

那就剩下interface_update_complete这个函数了，看起来就非常可疑；其唯一调用者proto_shell_update_link在proto-shell.c这个令人非常在意的源码中。
一通跳转后，在包括proto_shell_notify、proto_shell_attach、proto_shell_add_handler几个函数后，跳入了最上层的初始化函数proto_shell_init。
其中通过/lib/netifd/proto/目录，导入了一切系统内支持的协议：

# ls -l /lib/netifd/proto/
-rwxr-xr-x    1 root     root          6279 Jun 10 00:27 bonding.sh
-rwxrwxr-x    1 root     root          2868 Jun 10 00:27 dhcp.sh
-rwxr-xr-x    1 root     root          4902 Jun 10 00:27 dhcpv6.sh
-rwxr-xr-x    1 root     root          7833 Jun 10 00:27 ppp.sh

映入眼帘的是熟悉的几种协议；为了确认到底是哪个协议导致的reload，需要检查系统中是否有对应协议的daemon进程，以及其进程是否有调用其他外置脚本。

哪个协议在作妖

一通操作下来，系统中既存在dhcp又有dhcpv6，当然标配的ppp也在列；这下头大了，只能臆测一下可能的始作俑者了。

OpenWrt中跨进程/服务通讯主要有两种方式：

外置shell脚本通讯
ubus通讯
其中ubus脚本一般需要在源代码中引入相关类库，并在源代码中发送信号；而shell脚本则比较灵活，通过引用OpenWrt事先准备好的lib，就可以方便灵活的通讯（虽然底层也可能是ubus/luci之类的）。

因为各种原因（懒），我在这就先排查了引用的外置脚本：

其中ppp引用了/lib/netifd/目录下的三个脚本ppp-up、ppp6-up、ppp-down；可是我们的ppp连接并未重播，先排除这个。

dhcp/6我使用的是OpenWrt的udhcpc和odhcp6c；分别调用了同样是/lib/netifd/目录下的dhcp.script和dhcpv6.script。

检查脚本内容中后，发现其均引用了/lib/netifd/netifd-proto.sh；并在代码中调用了其提供的proto_init_update和proto_send_update函数。

同时这两个脚本都有处理dns服务器相关的代码，基本可以确定原因就是这二者之中的一个了。

进一步查看netifd-proto.sh代码后也确认了这个想法，其中proto_init_update函数中有如下代码：

proto_init_update() {
......
    json_add_int action 0
......
}

这段代码中，刚好对应了proto-shell.c中，proto_shell_notify函数处理NOTIFY_ACTION的代码，其传入参数值为0：

enum {
    NOTIFY_ACTION,
    ......
};

static const struct blobmsg_policy notify_attr[__NOTIFY_LAST] = {
    [NOTIFY_ACTION] = { .name = "action", .type = BLOBMSG_TYPE_INT32 },
    ......
};
static int
proto_shell_notify(struct interface_proto_state *proto, struct blob_attr *attr)
{
    struct proto_shell_state *state;
    struct blob_attr *tb[__NOTIFY_LAST];

    state = container_of(proto, struct proto_shell_state, proto);

    blobmsg_parse(notify_attr, __NOTIFY_LAST, tb, blob_data(attr), blob_len(attr));
    if (!tb[NOTIFY_ACTION])
        return UBUS_STATUS_INVALID_ARGUMENT;

    switch(blobmsg_get_u32(tb[NOTIFY_ACTION])) {
    case 0:
        return proto_shell_update_link(state, attr, tb);
    ......
}

刚好这两个脚本支持hot-plug user scripts，为我们进一步确认元凶提供了帮助：
udhcpc可以在/etc/udhcpc.user文件中，或者/etc/udhcpc.user.d目录中添加对应自定义脚本；
odhcpc6同样可以在/etc/odhcp6c.user文件中，或者/etc/odhcp6c.user.d目录中（OpenWrt 21.02版本后支持，具体可以通过上述的sciprs文件确认）添加自定义脚本。

自定义脚本内容很简单，以odhcpc6为例：

#!/bin/sh
date >> /tmp/odhcp6c.user.env
echo "$*" >> /tmp/odhcp6c.user.env
export >> /tmp/odhcp6c.user.env
echo ==================== >> /tmp/odhcp6c.user.env

真相只有一个，缘来就是你

在脚本中记录相关参数后，通过手动模拟/tmp目录满载，检查对应日志文件果然发现了端倪。

在每次dnsmasq抱怨在文件中找不到dns服务器时，resolve文件被更新，odhcpc6也总是会有一条调用日志：

# logread | grep retry
Fri Aug 16 17:50:06 2024 daemon.warn dnsmasq[32099]: no servers found in /tmp/resolv.conf.d/resolv.conf.auto, will retry    

# date -r /tmp/resolv.conf.d/resolv.conf.auto
Fri Aug 16 18:26:13 CST 2024

# cat /tmp/odhcp6c.user.env
Fri Aug 16 17:50:06 CST 2024
pppoe-wan ra-updated

折腾了这么久，终于确认了原来导致系统更新DNS的每隔10分钟左右，运营商服务器就会发布一次RA，其中除了PD/GW等相关信息之外，还一并包括了DNS的相关信息，从而导致netifd重新创建reslove文件。
而刚好在网络配置文件中，默认忽略了来自IPv6的DNS，从而导致没有注意到原来是IPv6 RA导致的问题。

折腾了这么久，终于找到问题所在，但是想了想解决方案却和找到的没什么关系；没错就是增大/tmp目录、定时清空其中部分调试日志，防止其被占用满；这么一大圈下来不仅没有发现bug，甚至还是一个feature，所以也没发现啥新的解决方法。

折腾就是这样的，乐趣往往就在过程之中(~~就是真的太费时间了~~)，解决方案可能只是其顺带的副产物～尤其是还经常会因为找不到原因，花太多时间而被迫终止折腾，最后可能还是在某次突然灵感迸发后，才有新的思路。

总之这次成功的折腾我还是很开心的，希望以后的折腾也能这么顺利～））（总觉得好像以前也折腾过，只是没有成功；~~还是我忘了，上了年纪越来越依靠外部记录的帮助了~~）

参考文章

[1] https://github.com/openwrt/netifd/tree/openwrt-21.02

排查OpenWrt DNS自动消失的问题

前言

谁动了reslove.conf

谁触发了刷新resolve

哪个协议在作妖

真相只有一个，缘来就是你

参考文章

排查OpenWrt DNS自动消失的问题没有评论

发表回复取消回复

前言

谁动了reslove.conf

谁触发了刷新resolve

哪个协议在作妖

真相只有一个，缘来就是你

参考文章

排查OpenWrt DNS自动消失的问题没有评论

发表回复 取消回复

发表回复取消回复