Static IP assigned Windows machines started failing DNS resolution against our internal DNS / AD servers
A problem recently appeared for me in April / June 2022, and I don't believe we changed anything on our end.
TLDR: Windows servers assigned static IPs can no longer do DNS resolution or Active Directory operations that leverage Domain Controllers running on our
internal network. This worked fine for several years, and just recently stopped working without any known configuration change.
We have a Comcast Business Gateway with 5 static IPs (in the 96.81.X.X range) for our public facing Windows servers. All of our other computers/devices are on the internal 10.1.10.X network, managed by the Gateway's DHCP service, with a few reserved IPs for some internal servers, such as 2 Windows Active Directory Domain Controllers that also run DNS (referred to as our DCs). The Comcast Gateway is configured to provide to DHCP clients the DNS IPs of our internal DCs (in the 10.1.10.X range), and our DCs DNS service have the Comcast DNS servers configured as forwarders. Those same internal DNS IPs are manually assigned in the static IP config of our publicly facing servers.
The machines on our internal network are part of the Active Directory Domain (managed by our DCs), and they talk to the internal DCs for authentication/authorization and to resolve the names of other machines on our network (or outside our network) via the DNS services running on those DCs. From the internal network standpoint, everything is continuing to work perfectly.
The problem that recently appeared is with our public facing machines that are assigned static IPs. These are accessible on the public internet as expected. On those machines, if I tracert/traceroute to a 10.1.10.x IP, such as one of the DCs, there are exactly 2 hops - to the static IP gateway (+1 of the last octet of our last assigned static IP), and then directly to the target 10.1.10.x IP. Pings to the internal IPs also go through fine. This all seems to be the result of the magic of the Comcast Business Gateways handling routing from the static IPs into the 10.1.10.x internal network. Everything was working great for the last 2 years that we have had this Comcast service. The machines with static IPs were joined into our Active Directory Domain 2 years ago, and apparently had no trouble discovering the necessary DCs on the internal network. The Forward Lookup Zone of our DCs DNS service has an entry for our domain which has FQDNs for all the machines along with their expected internal or external IPs (without any manual intervention).
However, as of the last few weeks, the servers assigned to our static IPs can continue to tracert and ping to the internal DCs IPs. But, all the Active Directory Domain resolution appears to have stopped working, and my understanding is that the Active Directory Domain stuff relies heavily on DNS resolution, which is also not working. The machines still believe they are members of the domain, but attempts to lookup the domain in the Security tab of a file system artifact in Windows Explorer does not present the domain as an option, nltest commands fail, as do attempts to run "nslookup <FQDN of our DC>" despite tracert and ping working just fine for that <internal IP of our DC>.
If I temporarily switch the public facing server to get its IP by DHCP, it picks up an internal IP and immediately everything starts working, and all the test succeed. But when I then reassign the static IP, the subnet mask of 255.255.255.248 (appropriate for 5 static IP configuration), and the appropriate 96.81.X.X gateway IP, everything immediately starts failing again regarding DNS and Active Directory.
This makes me think that either:
a. Something is wrong with my IP configuration on the public Windows server machine, but the only things I'm changing are the IP, subnet mask, and gateway.
b. Something is wrong in the networking layer of the Gateway, which I don't have any visibility into (as far as I know), and it had worked fine before.
c. Something has changed in our DCs configuration (Active Directory or DNS), which we haven't changed at all.
The specific error for the "nslookup <FQDN of our DC>" command is as follows. The Address IP is the expected DNS server (one of our DCs)
*** UnKnown can't find nevoad1.nevo-ad.office.nevo.com: Non-existent domain
The "Non-existent domain" message seems to be the best tendril to look into, but leads me to articles about changing the Forward or Reverse Lookup Zones in our DNS server. But that doesn't seem to be the issue to me if just switching the machine's IP to be a DHCP assigned one makes all these commands work.
I wondered about potential firewall issues internal to the Gateway, but tests of "PortQry.exe -n 10.1.10.116 -p UDP -e 53" and "PortQry.exe -n 10.1.10.37 -e 53" both come back with the LISTENING response which I understand implies that port 53 on the internal DCs is accessible to the public servers assigned static IPs, and I believe port 53 (UDP and TCP) is what DNS uses. If I do change the software firewall on the DCs to block all incoming traffic, then the PortQry times out as expected. And if I completely turn off the software firewalls on both the public server and the internal DCs, there is no change in behavior, with the PortQry working but the DNS and Active Directory integration fails again. There are no other firewall elements in our network, other than whatever the Gateway may be doing. We have the Gateway's "Disable Firewall for True Static IP Subnet Only" checked, but I assume that is for the outside world reaching into the static IP assigned servers. And we have the Gateway general Firewall Security Level = Typical Security (Medium).
I have been ignoring IPv6, and have the IPv6 network adapters turned off on all the servers. Nothing has changed on our side regarding that, and so I haven't been pursuing any investigation there.
About a year ago, I feel like we had a somewhat similar problem that I struggled to diagnose. In the end, Comcast swapped out an older generation Gateway for their latest & greatest version, and everything just magically started working again. As a result, I had Comcast swap out our Gateway again, and it just went from a CGA4131COM to a CGA4332COM last Friday, but that didn't fix anything. The new one looks identical to the old one, and I wonder if it is running the same firmware. Where as the one that "failed" about a year ago was a very different gateway, so likely had a more significant change when they swapped it out.
I'm starting to really pound my head against the wall on this, and I crossed my figures that swapping out the Gateway would have fixed it, because I don't think we changed anything on our end. But no joy. Any chance anyone has battled this? Or is the configuration I described not apt to work in the first place, and I just got lucky over the past 2 years? Moving the DCs (or 1 DC) to a static IP doesn't feel like a great solution to me, but I guess if that was a strong recommendation from someone, I could consider that.
I've held short on bringing Wireshark to bear, or trying to figure out what advanced logging I might be able to enable (or might already be enabled) on the DNS server (DCs) side or client machine side... assuming the DNS resolution problem is the right one to be focusing on.
Thanks for taking the time to read this!
11 months ago
Hi there, @pjaffe!
Thanks so much for taking the time to write out all that information, and for all the troubleshooting you've already attempted! It's so helpful for us to know what steps have been tried, I really, really appreciate you!
We can check things on our side to see if there's anything related to the static IPs and/or the gateway that we can fix, and we'd be glad to doublecheck :)
If you're up for that, please send over a direct message using the following instructions: