.claude/skills/terway-e2e/SKILL.md
Run Terway E2E tests on an ACK cluster. Covers the full workflow: build image → create ACK cluster via Terraform → deploy Terway → run e2e tests → cleanup. Use this skill when: - Running any Terway E2E tests (Prefix, Connectivity, PodNetworking, SecurityGroup, MultiNetwork, etc.) - Creating ACK clusters for Terway testing - Building and deploying Terway images - Validating Terway features or troubleshooting E2E test failures - The user mentions "e2e", "terway test", "ack cluster", "prefix test", or any Terway feature validation
npx skillsauth add aliyuncontainerservice/terway terway-e2eInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete workflow for running Terway end-to-end tests on ACK clusters.
Terway is Alibaba Cloud's CNI plugin for Kubernetes. E2E tests validate:
Registry: Replace
registry.cn-hangzhou.aliyuncs.com/acsin all examples below with your own image registry namespace (e.g.,registry.cn-hangzhou.aliyuncs.com/<your-namespace>). Make sure you have push access to that namespace before building.
# Record the image tag (git short commit)
IMAGE_TAG=$(git rev-parse --short HEAD)
echo "Image tag: $IMAGE_TAG"
# If running directly on the Linux dev server:
cd /root/opensource_terway
make BUILD_PLATFORMS=linux/amd64,linux/arm64 \
REGISTRY=registry.cn-hangzhou.aliyuncs.com/acs \
build-push-terway build-push-terway-controlplane
# If building from macOS/remote, sync first then build:
PROJECT_ROOT=$(git rev-parse --show-toplevel)
PROJECT_NAME=$(basename "$PROJECT_ROOT")
rsync -ar --exclude=bin/ --delete "$PROJECT_ROOT/" root@dev:"/root/$PROJECT_NAME/"
ssh dev "bash -l -c 'cd /root/$PROJECT_NAME && make BUILD_PLATFORMS=linux/amd64,linux/arm64 REGISTRY=registry.cn-hangzhou.aliyuncs.com/acs build-push-terway build-push-terway-controlplane'"
CRITICAL — Registry Endpoint: Always use the public registry endpoint
registry.cn-hangzhou.aliyuncs.com(NOT the VPC endpointregistry-vpc.cn-hangzhou.aliyuncs.com). The VPC endpoint causesdial tcp ...: i/o timeoutfrom dev servers that are not inside the same VPC as the registry. The public endpoint works from any ECS in the same region.
cd hack/terraform/ack/
# Initialize (first time only)
terraform init
# Preview changes
terraform plan
# Create cluster (takes 10-20 minutes)
terraform apply -auto-approve
# Verify kubeconfig was generated
ls kubeconfig-tf-ack-*
export KUBECONFIG=$(ls kubeconfig-tf-ack-* | head -1)
# Verify cluster is accessible
kubectl get nodes
The Terraform configuration creates:
k8s.aliyun.com/ip-prefix=true and k8s.aliyun.com/ignore-by-terway=true for dedicated prefix E2E testing. Initially isolated from terway scheduling; the E2E test removes the ignore label to bring them online.IP Prefix Node Pool: The
ip-prefixnode pool starts withk8s.aliyun.com/ignore-by-terway=trueso terway does not schedule onto these nodes at cluster creation time. The E2E test setup flow (setupIPPrefixNodes) removes this label and waits for terway to become ready before running tests. Teardown restores the label.
Key configuration files:
terraform.tfvars: Cluster configuration (region, instance types, CIDRs)terraform_e2e.tf: Main Terraform configurationcd hack/terraform/ack/
# Deploy with ENIMultiIP mode (required for Prefix/Trunk)
# --pull-policy Always ensures nodes re-pull the image on every restart (recommended for test builds)
./deploy-terway.sh \
--tag <IMAGE_TAG> \
--registry registry.cn-hangzhou.aliyuncs.com/acs \
--daemon-mode ENIMultiIP \
--pull-policy Always
# Or with Datapath V2 (kube-proxy replacement)
./deploy-terway.sh \
--tag <IMAGE_TAG> \
--registry registry.cn-hangzhou.aliyuncs.com/acs \
--daemon-mode ENIMultiIP \
--pull-policy Always \
--enable-dp-v2 \
--enable-network-policy
# Verify all pods are Running
kubectl get pods -n kube-system | grep -E 'terway'
# Expected:
# terway-controlplane-xxx 2/2 Running
# terway-eniip-xxx 2/2 Running (one per node)
Note:
deploy-terway.shreads cluster metadata (clusterID, VPC, security group, vSwitches) automatically fromterraform.tfstate. No manual values.yaml editing needed. The generatedterway-values.yamlis overwritten on each run.
| Option | Description |
|---------------------------|------------------------------------------------------------|
| --daemon-mode | ENIMultiIP (shared ENI) or ENIDirectIP (exclusive ENI) |
| --ip-stack | ipv4, ipv6, or dual |
| --enable-dp-v2 | Enable Datapath V2 (kube-proxy replacement) |
| --enable-network-policy | Enable eBPF-based network policy |
| Test | File | Description |
|-------------------|-------------------------------------------|-----------------------------------------------|
| Prefix | prefix_basic_test.go, prefix_boundary_test.go, prefix_helpers_test.go | Prefix allocation, boundary conditions, scale |
| Prefix E2E | prefix_e2e_test.go | End-to-end prefix validation on dedicated node pool |
| Prefix E2E | prefix_e2e_test.go | End-to-end Pod IP validation on dedicated prefix node pool |
| Connectivity | connective_test.go, connectivity_*.go | Pod-to-Pod, Service connectivity |
| PodNetworking | pod_networking_test.go | Custom network via CRD |
| SecurityGroup | security_group_test.go | SG isolation and rules |
| MultiNetwork | multi_network_test.go | Multiple ENIs per pod |
| Upgrade | upgrade_test.go | Upgrade compatibility |
| ERDMA | erdma/erdma_test.go | ERDMA device support |
# Set kubeconfig (e2e tests read ~/.kube/config by default)
cp hack/terraform/ack/kubeconfig-tf-ack-* ~/.kube/config
# OR: export KUBECONFIG=hack/terraform/ack/kubeconfig-tf-ack-*
# Run all prefix tests
cd /root/opensource_terway
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix' -region-id cn-hangzhou
# Run specific prefix suites
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_Basic' -region-id cn-hangzhou
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_Boundary' -region-id cn-hangzhou
# Run connectivity tests
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestConnectivity' -region-id cn-hangzhou \
-repo registry.cn-hangzhou.aliyuncs.com/acs
IMPORTANT: Use
./tests(not./tests/...) — the e2e tests live directly in thetestspackage. The-region-idflag is required for prefix tests (used to query ECS API). The test binary reads~/.kube/configautomatically.
# Basic prefix allocation
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_Basic' -region-id cn-hangzhou
# Boundary conditions (max prefixes, exhaustion)
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_Boundary' -region-id cn-hangzhou
# Dual stack and EFLO tests
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_DualStack|TestPrefix_EFLO' -region-id cn-hangzhou
# State machine and scale tests
go test -v -count=1 -timeout 120m -tags e2e ./tests \
-run 'TestPrefix_State|TestPrefix_Scale' -region-id cn-hangzhou
# Prefix E2E tests (dedicated node pool, end-to-end pod IP validation)
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestPrefix_E2E' -region-id cn-hangzhou
| Instance | ENI adapters | Prefixes/ENI | Max prefix capacity | |--------------|--------------|--------------|---------------------| | g7ne.2xlarge | 6 | 14 | 84 | | g7ne.4xlarge | 8 | 14 | 112 |
Cleanup timeout warnings (resetNodePrefixState): Between tests, the 3-minute cleanup
wait sometimes expires while Deleting prefixes are still being unassigned from ENIs via
the ECS API. This is expected — tests emit a warning and proceed. The underlying
IPPrefixStatusDeleting controller handles the ECS API call asynchronously.
TestPrefix_Boundary_MaxValue: This test requests more prefixes than node capacity
(e.g., 134 requested, 84 max). The test verifies no over-allocation occurs. If the system
hasn't reached max capacity within the wait window, the test accepts any count ≤ maxCapacity
and still passes.
TestPrefix_Boundary_ENICapacity: Requests 17 prefixes when max-per-ENI is 14, verifying
the system spreads them across 2+ ENIs.
# All connectivity tests
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestConnectivity' -region-id cn-hangzhou \
-repo registry.cn-hangzhou.aliyuncs.com/acs
# Specific scenarios
go test -v -count=1 -timeout 60m -tags e2e ./tests \
-run 'TestConnectivity_NodePort' -region-id cn-hangzhou \
-repo registry.cn-hangzhou.aliyuncs.com/acs
When tests fail or when validating feature implementations, analyze both the test code and the implementation to discover issues.
Test Prerequisites: Check if tests verify cluster configuration correctly
// Example: Check IPAM type
func checkIPAMType(t *testing.T) {
cm := &corev1.ConfigMap{}
err := client.Resources().Get(ctx, "eni-config", "kube-system", cm)
// Verify ipam_type is "crd" for centralized IPAM
}
Node Selection: Tests should use proper node affinity
// Tests need correct node type selection
nodeInfo, _ := DiscoverNodeTypesWithCapacity(ctx, client)
if len(nodeInfo.ECSSharedENINodes) == 0 {
t.Skip("No ECS Shared ENI nodes available")
}
Resource Cleanup: Verify tests clean up resources properly
Version Requirements: Some features require specific Terway versions
// Example from tests
if !RequireTerwayVersion("v1.17.0") {
t.Skip("Requires Terway >= v1.17.0")
}
Cluster Configuration:
ipam_type: crd) required for PrefixNode Labels:
terway-config: <configmap-name> for Dynamic Configip-prefix: "true" for prefix modek8s.aliyun.com/exclusive-mode-eni-type for exclusive ENI| Issue | Cause | Solution |
|------------------------------------------------|------------------------------------------------------|---------------------------------------------------------------------------------------|
| dial tcp ...: i/o timeout during push | Using VPC registry endpoint from wrong network | Switch to public endpoint: registry.cn-hangzhou.aliyuncs.com (NOT registry-vpc.*) |
| Tests skip with "ipam type is not crd" | Cluster uses non-centralized IPAM | Ensure ipam_type: crd in eni-config |
| Tests skip with "No ECS Shared ENI nodes" | Only Lingjun/EFLO nodes present | Add standard ECS nodes |
| Prefix not allocated | Missing node labels | Add ip-prefix: "true" and terway-config labels |
| Pod stuck in Creating | ENI quota exceeded | Check ENI quota or use prefix mode |
| Connectivity failures | Security group rules | Check SG allows pod traffic |
| resetNodePrefixState cleanup timeout warning | Deleting prefixes take >3m to unassign via ECS API | Expected behavior — test continues; no fix needed |
| Nodes in NotReady state after cluster creation | CNI not yet installed | Deploy Terway via deploy-terway.sh first |
Check Pod Status:
kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
Check Terway Logs:
kubectl logs -n kube-system <terway-eniip-pod> -c terway
kubectl logs -n kube-system <terway-controlplane-pod>
Check CR Status:
kubectl get podeni -A
kubectl describe podeni <name> -n <namespace>
kubectl get podnetworking
kubectl describe podnetworking <name>
Check Node Resources:
kubectl describe node <node-name> | grep -A 10 "Allocatable"
# Check for aliyun/eni, aliyun/member-eni resources
cd hack/terraform/ack/
terraform destroy -auto-approve
| File | Purpose |
|---------------------------------------|-------------------------------------------|
| hack/terraform/ack/ | Terraform configs for ACK cluster |
| hack/terraform/ack/deploy-terway.sh | Helm-based Terway deployment script |
| tests/main_test.go | Test initialization and cluster discovery |
| tests/utils_test.go | Test utilities (Pod, Service builders) |
| tests/node_utils_test.go | Node type discovery and capacity checks |
| tests/prefix_*.go | Prefix mode tests |
| tests/connective*.go | Connectivity tests |
| tests/pod_networking_test.go | PodNetworking CRD tests |
| tests/security_group_test.go | Security group tests |
| types/daemon/dynamicconfig.go | Dynamic Config implementation |
| Type | Label | Description |
|-------------------------|--------------------------------------------------|-------------------------------------|
| ecs-shared-eni | (default) | Standard ECS nodes, shared ENI mode |
| ecs-exclusive-eni | k8s.aliyun.com/exclusive-mode-eni-type=eniOnly | ECS nodes, exclusive ENI |
| ecs-ip-prefix | k8s.aliyun.com/ip-prefix=true | ECS nodes, dedicated IP Prefix E2E pool |
| lingjun-shared-eni | alibabacloud.com/lingjun-worker=true | Lingjun/EFLO nodes, shared ENI |
| lingjun-exclusive-eni | Both labels above | Lingjun nodes, exclusive ENI |
| Flag | Default | Description |
|-----------------------|------------------------------------------------|-------------------------------|
| -repo | registry.cn-hangzhou.aliyuncs.com/build-test | Image registry |
| -timeout | 2m | Default test timeout |
| -enable-trunk | true | Enable trunk mode tests |
| -region-id | cn-hangzhou | Alibaba Cloud region |
| -vswitch-ids | "" | Additional vSwitch IDs |
| -security-group-ids | "" | Additional security group IDs |
tools
Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM).
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.
development
Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.