Esc
Type to search posts, tags, and more...
Skip to content

Building a Network Configuration Linter with Batfish

Using Batfish to validate network configurations before deployment — catching routing loops, unreachable subnets, and policy violations without touching a live device.

Contents

The Case for Pre-Deployment Validation

Code has linters. Infrastructure as code has terraform validate. But network configurations? Most teams still validate by deploying to a lab and manually checking. Or worse, deploying to production and hoping.

Batfish changes this. It is an open-source network configuration analysis tool that builds a model of your network from config files and answers questions about it — reachability, routing, ACLs, BGP sessions — without needing a running network.

Setting Up Batfish

Batfish runs as a Docker container with a Python client:

$ docker run -d -p 9997:9997 -p 9996:9996 batfish/allinone
from pybatfish.client.session import Session
from pybatfish.datamodel import HeaderConstraints

bf = Session(host="localhost")
bf.set_network("production")
bf.init_snapshot("/path/to/configs", name="candidate")

Point it at a directory of router configs and it parses them into a network model. Cisco IOS, IOS-XE, Junos, Arista EOS — it handles all the major vendors.

Reachability Checks

The most powerful query is reachability. Given a source and destination, can traffic flow?

result = bf.q.reachability(
    pathConstraints=PathConstraints(
        startLocation="/10.1.0.0/24/"
    ),
    headers=HeaderConstraints(
        dstIps="10.2.0.0/24",
        applications=["https"]
    ),
    actions="SUCCESS,FAILURE"
).answer()

for row in result.frame().itertuples():
    print(f"Flow: {row.Flow}")
    print(f"Action: {row.Action}")
    print(f"Traces: {row.Traces}")

This tells you not just whether the traffic arrives, but the exact path it takes — every hop, every interface, every ACL evaluation. If the traffic is denied, it shows you exactly which ACL line dropped it.

Building Lint Rules

With Batfish as the engine, we can define lint rules that run against every config change:

def lint_no_default_route_leak(bf):
    """Ensure default routes do not leak between VRFs."""
    routes = bf.q.routes(
        network="0.0.0.0/0",
        protocols="bgp"
    ).answer().frame()

    violations = []
    for _, row in routes.iterrows():
        if row["VRF"] != "default" and row["Next_Hop_IP"] == "0.0.0.0":
            violations.append(
                f"{row['Node']}: default route in VRF {row['VRF']}"
            )
    return violations


def lint_bgp_sessions_established(bf):
    """Verify all configured BGP sessions can establish."""
    sessions = bf.q.bgpSessionStatus().answer().frame()

    violations = []
    for _, row in sessions.iterrows():
        if row["Established_Status"] != "ESTABLISHED":
            violations.append(
                f"{row['Node']}: BGP session to {row['Remote_Node']} "
                f"is {row['Established_Status']}"
            )
    return violations


def lint_unused_acls(bf):
    """Find ACLs that are defined but not applied to any interface."""
    refs = bf.q.unusedStructures().answer().frame()

    violations = []
    for _, row in refs.iterrows():
        if "acl" in row["Structure_Type"].lower():
            violations.append(
                f"{row['Source']}: unused ACL '{row['Structure_Name']}'"
            )
    return violations

CI Pipeline Integration

The lint rules plug into a CI pipeline. When an engineer opens a pull request with config changes, the pipeline:

  1. Spins up a Batfish container
  2. Loads the candidate configs
  3. Runs all lint rules
  4. Posts results as PR comments
  5. Blocks merge on any critical violations
# .gitlab-ci.yml
config-lint:
  stage: validate
  image: python:3.11
  services:
    - batfish/allinone
  script:
    - pip install pybatfish
    - python scripts/lint_configs.py configs/
  rules:
    - changes:
        - configs/**/*

What We Catch

In the first month of running the linter, we caught:

  • 3 ACL misconfigurations — rules that would have blocked legitimate traffic
  • 1 OSPF area mismatch — two interfaces in different areas that should have been in area 0
  • 2 unused ACLs — leftover from decommissioned services, now cleaned up
  • 1 BGP route leak — a VRF was importing routes from a route target it should not have been

Each of these would have been a production incident. The linter cost about a week to set up. The math is clear.

Limitations

Batfish models the control plane, not the data plane. It cannot simulate hardware TCAM limits, queuing behavior, or timing-dependent issues. It also requires complete configs — if your TACACS server pushes dynamic ACLs, Batfish will not see them.

For what it does cover — routing correctness, ACL evaluation, BGP policy — it is the best tool available. Think of it as show ip route for configurations that have not been deployed yet.

! Was this useful?