skills/scientific-computing/networkx-graph-analysis/SKILL.md
Graph and network analysis toolkit. Four graph types (directed, undirected, multi-edge), centrality, shortest paths, community detection, generators, I/O (GraphML, GML, edge list), matplotlib viz. For large graphs (100K+ nodes) use igraph or graph-tool; for GNNs use PyG.
npx skillsauth add jaechang-hits/scicraft networkx-graph-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NetworkX is a Python library for creating, manipulating, and analyzing complex networks and graphs. It provides data structures for undirected, directed, and multi-edge graphs along with a comprehensive collection of graph algorithms, generators, and I/O utilities. Use NetworkX when working with relationship data in social networks, biological interaction networks, transportation systems, citation graphs, or any domain involving pairwise entity relationships.
igraph or graph-tool insteadgraph-tool with OpenMP or cuGraphtorch-geometric-graph-neural-networksnetworkx, matplotlib, scipy, pandas, numpypydot or pygraphviz (Graphviz layouts)pip install networkx matplotlib scipy pandas numpy
import networkx as nx
# Create a graph and add edges with weights
G = nx.karate_club_graph()
print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
# Nodes: 34, Edges: 78
# Compute centrality and find most central node
bc = nx.betweenness_centrality(G)
top_node = max(bc, key=bc.get)
print(f"Most central node: {top_node}, betweenness: {bc[top_node]:.3f}")
# Detect communities
from networkx.algorithms import community
comms = community.greedy_modularity_communities(G)
print(f"Communities found: {len(comms)}")
import networkx as nx
# Undirected graph (most common)
G = nx.Graph()
G.add_node("protein_A", type="kinase", weight=1.5)
G.add_nodes_from(["protein_B", "protein_C"])
G.add_edge("protein_A", "protein_B", weight=0.9, interaction="phosphorylation")
G.add_edges_from([("protein_B", "protein_C"), ("protein_A", "protein_C")])
print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
# Nodes: 3, Edges: 3
# Directed graph (gene regulation, citations)
D = nx.DiGraph()
D.add_edges_from([("TF1", "geneA"), ("TF1", "geneB"), ("TF2", "geneA")])
print(f"TF1 out-degree: {D.out_degree('TF1')}") # 2
# MultiGraph (multiple relationship types between same nodes)
M = nx.MultiGraph()
M.add_edge("A", "B", key="binding", affinity=0.8)
M.add_edge("A", "B", key="regulation", effect="inhibition")
print(f"Edges between A-B: {M.number_of_edges('A', 'B')}") # 2
import networkx as nx
G = nx.karate_club_graph()
# Query structure
print(f"Degree of node 0: {G.degree(0)}")
print(f"Neighbors of node 0: {list(G.neighbors(0))[:5]}")
print(f"Has edge 0-1: {G.has_edge(0, 1)}")
# Set and get attributes
G.nodes[0]["role"] = "instructor"
nx.set_node_attributes(G, {0: "high", 33: "high"}, "importance")
G[0][1]["weight"] = 0.95
# Iterate with data
for u, v, data in G.edges(data=True):
if "weight" in data:
print(f" Edge {u}-{v}: weight={data['weight']}")
break
# Subgraphs (returns read-only view; use .copy() for mutable)
H = G.subgraph([0, 1, 2, 3, 4, 5]).copy()
print(f"Subgraph: {H.number_of_nodes()} nodes, {H.number_of_edges()} edges")
import networkx as nx
G = nx.karate_club_graph()
degree_c = nx.degree_centrality(G)
between_c = nx.betweenness_centrality(G, weight="weight")
# For large graphs, approximate: nx.betweenness_centrality(G, k=100)
close_c = nx.closeness_centrality(G)
eigen_c = nx.eigenvector_centrality(G, max_iter=1000)
pr = nx.pagerank(G, alpha=0.85)
# Compare top nodes across measures
for name, metric in [("Degree", degree_c), ("Betweenness", between_c),
("Closeness", close_c), ("PageRank", pr)]:
top = max(metric, key=metric.get)
print(f"{name:12s}: top node={top}, score={metric[top]:.4f}")
import networkx as nx
G = nx.karate_club_graph()
# Shortest path
path = nx.shortest_path(G, source=0, target=33)
length = nx.shortest_path_length(G, source=0, target=33)
print(f"Shortest path 0->33: {path} (length {length})")
print(f"Average shortest path length: {nx.average_shortest_path_length(G):.3f}")
# Connected components
print(f"Connected: {nx.is_connected(G)}")
components = list(nx.connected_components(G))
print(f"Components: {len(components)}, largest: {len(max(components, key=len))}")
# For directed graphs: strong/weak connectivity
D = nx.DiGraph([(0,1),(1,2),(2,0),(3,4)])
print(f"Strongly connected: {list(nx.strongly_connected_components(D))}")
# Connectivity measures
print(f"Node connectivity: {nx.node_connectivity(G)}")
print(f"Edge connectivity: {nx.edge_connectivity(G)}")
Partition networks into densely connected groups.
import networkx as nx
from networkx.algorithms import community
import itertools
G = nx.karate_club_graph()
# Greedy modularity maximization
comms_greedy = community.greedy_modularity_communities(G)
mod_score = community.modularity(G, comms_greedy)
print(f"Greedy: {len(comms_greedy)} communities, modularity={mod_score:.4f}")
# Label propagation (fast, non-deterministic)
comms_lpa = community.label_propagation_communities(G)
print(f"Label propagation: {len(list(comms_lpa))} communities")
# Girvan-Newman (hierarchical, edge betweenness removal)
gn = community.girvan_newman(G)
# Get first level of partition
first_level = next(gn)
print(f"Girvan-Newman first split: {len(first_level)} groups")
print(f" Sizes: {[len(c) for c in first_level]}")
import networkx as nx
import pandas as pd
import json
G = nx.karate_club_graph()
# Edge list (simple text format)
nx.write_edgelist(G, "karate.edgelist")
G_loaded = nx.read_edgelist("karate.edgelist", nodetype=int)
# GraphML (preserves all attributes, XML-based)
nx.write_graphml(G, "karate.graphml")
G_xml = nx.read_graphml("karate.graphml")
# JSON (node-link format, web-friendly for d3.js)
data = nx.node_link_data(G)
with open("karate.json", "w") as f:
json.dump(data, f)
# Pandas integration
df = pd.DataFrame({"source": [1,2,3], "target": [2,3,4], "weight": [0.5,1.0,0.75]})
G_pd = nx.from_pandas_edgelist(df, "source", "target", edge_attr="weight")
df_out = nx.to_pandas_edgelist(G_pd)
print(f"Pandas round-trip: {len(df_out)} edges")
# NumPy/SciPy matrices
A = nx.to_numpy_array(G)
print(f"Adjacency matrix shape: {A.shape}")
A_sparse = nx.to_scipy_sparse_array(G, format="csr") # Memory-efficient
import networkx as nx
import matplotlib.pyplot as plt
G = nx.karate_club_graph()
pos = nx.spring_layout(G, seed=42)
# Color by degree, size by betweenness centrality
bc = nx.betweenness_centrality(G)
fig, ax = plt.subplots(figsize=(10, 8))
nx.draw(G, pos=pos, ax=ax,
node_color=[G.degree(n) for n in G.nodes()], cmap=plt.cm.viridis,
node_size=[3000 * bc[n] + 100 for n in G.nodes()],
edge_color="gray", alpha=0.8, with_labels=True, font_size=8)
plt.tight_layout()
plt.savefig("network.png", dpi=300, bbox_inches="tight")
plt.savefig("network.pdf", bbox_inches="tight") # Vector format
print("Saved network.png and network.pdf")
import networkx as nx
# Erdos-Renyi random graph: n nodes, edge probability p
G_er = nx.erdos_renyi_graph(n=200, p=0.05, seed=42)
print(f"ER: {G_er.number_of_nodes()} nodes, {G_er.number_of_edges()} edges")
# Barabasi-Albert scale-free (power-law degree distribution)
G_ba = nx.barabasi_albert_graph(n=200, m=3, seed=42)
# Watts-Strogatz small-world
G_ws = nx.watts_strogatz_graph(n=200, k=6, p=0.1, seed=42)
print(f"WS clustering: {nx.average_clustering(G_ws):.3f}")
# Stochastic block model (community structure)
sizes, probs = [50, 50, 50], [[0.25,0.05,0.02],[0.05,0.35,0.07],[0.02,0.07,0.40]]
G_sbm = nx.stochastic_block_model(sizes, probs, seed=42)
# Built-in datasets and classic graphs
G_karate = nx.karate_club_graph() # Zachary's karate club
G_grid = nx.grid_2d_graph(5, 7) # 2D lattice
G_tree = nx.random_tree(n=50, seed=42) # Random tree
G_geo = nx.random_geometric_graph(n=100, radius=0.2, seed=42)
# See references/algorithms_generators.md for full generator catalog
| Class | Directed | Multi-edge | Self-loops | Use Case |
|-------|----------|------------|------------|----------|
| Graph | No | No | Yes | Undirected networks: social, PPI |
| DiGraph | Yes | No | Yes | Gene regulation, citations, web |
| MultiGraph | No | Yes | Yes | Multiple relationship types |
| MultiDiGraph | Yes | Yes | Yes | Transportation with routes |
Attributes are stored as dictionaries at graph, node, and edge levels:
import networkx as nx
G = nx.Graph(name="example") # Graph-level attribute
G.add_node(1, label="hub", weight=1.5) # Node attributes
G.add_edge(1, 2, weight=0.8, type="ppi") # Edge attributes
# Bulk set/get
nx.set_node_attributes(G, {1: "red", 2: "blue"}, "color")
colors = nx.get_node_attributes(G, "color") # {1: 'red', 2: 'blue'}
| Layout | Function | Best For |
|--------|----------|----------|
| Spring (force-directed) | spring_layout(G, seed=42) | General networks |
| Circular | circular_layout(G) | Regular graphs, cycles |
| Kamada-Kawai | kamada_kawai_layout(G) | Small-medium networks |
| Spectral | spectral_layout(G) | Highlighting clusters |
| Shell (concentric) | shell_layout(G, nlist=[[...],[...]]) | Layered/hierarchical |
| Planar | planar_layout(G) | Planar graphs only |
Goal: Identify influential actors, detect communities, and visualize.
import networkx as nx
import matplotlib.pyplot as plt
from networkx.algorithms import community
# Step 1: Load network and basic stats
G = nx.karate_club_graph()
print(f"Network: {G.number_of_nodes()} actors, {G.number_of_edges()} ties")
print(f"Density: {nx.density(G):.4f}, Clustering: {nx.average_clustering(G):.4f}")
# Step 2: Identify influential nodes
bc = nx.betweenness_centrality(G)
top_bc = sorted(bc.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 by betweenness:", [(n, f"{s:.3f}") for n, s in top_bc])
# Step 3: Detect communities
comms = community.greedy_modularity_communities(G)
print(f"Communities: {len(comms)}, modularity: {community.modularity(G, comms):.4f}")
# Step 4: Visualize with community coloring
pos = nx.spring_layout(G, seed=42)
fig, ax = plt.subplots(figsize=(10, 8))
for i, comm in enumerate(comms):
nx.draw_networkx_nodes(G, pos, nodelist=list(comm), ax=ax,
node_color=[plt.cm.Set2(i)]*len(comm), node_size=400)
nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.3)
nx.draw_networkx_labels(G, pos, ax=ax, font_size=8)
plt.axis("off")
plt.tight_layout()
plt.savefig("social_network_analysis.png", dpi=300, bbox_inches="tight")
print("Saved social_network_analysis.png")
Goal: Build a PPI network from tabular data, analyze topology, and identify hub proteins.
import networkx as nx
import pandas as pd
# Step 1: Load interaction data from DataFrame
interactions = pd.DataFrame({
"protein_a": ["TP53","TP53","BRCA1","BRCA1","MDM2","ATM","ATM","CHEK2","RB1","CDK2"],
"protein_b": ["MDM2","BRCA1","ATM","CHEK2","RB1","CHEK2","BRCA2","CDC25A","CDK2","CCNA2"],
"score": [0.99, 0.95, 0.92, 0.88, 0.91, 0.97, 0.85, 0.90, 0.87, 0.93]
})
G = nx.from_pandas_edgelist(interactions, "protein_a", "protein_b",
edge_attr="score")
print(f"PPI network: {G.number_of_nodes()} proteins, {G.number_of_edges()} interactions")
# Step 2: Network statistics
print(f"Connected: {nx.is_connected(G)}")
print(f"Diameter: {nx.diameter(G)}")
print(f"Avg path length: {nx.average_shortest_path_length(G):.2f}")
print(f"Transitivity: {nx.transitivity(G):.4f}")
# Step 3: Hub identification (multiple centrality measures)
degree_c = nx.degree_centrality(G)
between_c = nx.betweenness_centrality(G)
close_c = nx.closeness_centrality(G)
results = pd.DataFrame({
"protein": list(G.nodes()),
"degree_centrality": [degree_c[n] for n in G.nodes()],
"betweenness": [between_c[n] for n in G.nodes()],
"closeness": [close_c[n] for n in G.nodes()],
}).sort_values("betweenness", ascending=False)
print("\nHub proteins:")
print(results.head(5).to_string(index=False))
# Step 4: Export for downstream analysis
nx.write_graphml(G, "ppi_network.graphml")
results.to_csv("protein_centrality.csv", index=False)
print("Exported ppi_network.graphml and protein_centrality.csv")
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| weight | Paths/Centrality | None | Edge attribute name | Use weighted edges for path/centrality calculations |
| alpha | pagerank | 0.85 | 0.0-1.0 | Damping factor; lower = more uniform distribution |
| k | betweenness_centrality | None | int | Sample k nodes for approximation on large graphs |
| max_iter | eigenvector_centrality | 100 | int | Max iterations for convergence |
| seed | Generators/Layouts | None | int | Random seed for reproducibility |
| n / p / m | ER/BA generators | varies | int/float | Node count, edge probability, edges per new node |
| k / p | Watts-Strogatz | varies | int/float | Nearest neighbors, rewiring probability |
| nodetype | read_edgelist | str | int, float, str | Type conversion for node identifiers |
| edge_attr | from_pandas_edgelist | None | Column name(s) | Edge attribute columns to include from DataFrame |
| format | to_scipy_sparse_array | "csc" | "csr", "csc", "coo" | Sparse matrix format |
Always set random seeds for reproducible generators and layouts: seed=42 in both erdos_renyi_graph() and spring_layout().
Use approximate algorithms for large graphs: nx.betweenness_centrality(G, k=500) samples k nodes instead of all pairs.
Prefer from_pandas_edgelist over manual add_edge loops for bulk data loading -- handles attributes cleanly and is faster.
Copy subgraphs before modification: G.subgraph(nodes) returns a read-only view; call .copy() for a mutable independent graph.
Use GraphML or GML for persistent storage to preserve all node/edge attributes. Edge lists lose metadata unless explicitly handled.
Convert graph types explicitly: D.to_undirected() (DiGraph -> Graph), nx.Graph(M) (MultiGraph -> Graph, collapses multi-edges).
Use sparse matrices for large adjacency exports: to_scipy_sparse_array() is far more memory-efficient than to_numpy_array().
Anti-pattern -- Don't use nx.info(): Deprecated; use G.number_of_nodes(), G.number_of_edges(), nx.density(G) directly.
Anti-pattern -- Don't assume node ordering: Algorithms may return results in different orders. Always index by node key, not position.
Extract the minimum spanning tree and compare to the original graph.
import networkx as nx
# Create weighted graph
G = nx.erdos_renyi_graph(50, 0.15, seed=42)
for u, v in G.edges():
G[u][v]["weight"] = round(nx.utils.py_random_state(42).random(), 2)
mst = nx.minimum_spanning_tree(G, weight="weight")
print(f"Original: {G.number_of_edges()} edges")
print(f"MST: {mst.number_of_edges()} edges")
total_weight = sum(d["weight"] for _, _, d in mst.edges(data=True))
print(f"MST total weight: {total_weight:.2f}")
Find cliques and compute graph coloring.
import networkx as nx
G = nx.karate_club_graph()
# Find all maximal cliques
cliques = list(nx.find_cliques(G))
print(f"Maximal cliques: {len(cliques)}")
largest_clique = max(cliques, key=len)
print(f"Largest clique size: {len(largest_clique)}, nodes: {largest_clique}")
# Greedy graph coloring
coloring = nx.greedy_color(G, strategy="largest_first")
n_colors = max(coloring.values()) + 1
print(f"Chromatic number (greedy upper bound): {n_colors}")
Build a directed acyclic graph and find execution order.
import networkx as nx
# Task dependency DAG
D = nx.DiGraph()
D.add_edges_from([
("download_data", "preprocess"),
("download_data", "validate"),
("preprocess", "analyze"),
("validate", "analyze"),
("analyze", "visualize"),
("analyze", "report"),
("visualize", "report"),
])
print(f"Is DAG: {nx.is_directed_acyclic_graph(D)}")
order = list(nx.topological_sort(D))
print(f"Execution order: {order}")
# Find all paths from start to end
paths = list(nx.all_simple_paths(D, "download_data", "report"))
print(f"Paths to report: {len(paths)}")
for p in paths:
print(f" {' -> '.join(p)}")
| Problem | Cause | Solution |
|---------|-------|----------|
| NetworkXError: Graph is not connected | Algorithm requires connected graph | Extract largest component: G.subgraph(max(nx.connected_components(G), key=len)).copy() |
| PowerIterationFailedConvergence | Eigenvector/PageRank did not converge | Increase max_iter (e.g., 1000) or check for disconnected components |
| Very slow centrality computation | O(n*m) complexity on large graphs | Use k parameter for sampling: betweenness_centrality(G, k=500) |
| nx.NetworkXNotImplemented | Algorithm not available for graph type | Convert graph type: G.to_undirected() or G.to_directed() |
| Memory error on large graphs | Dense adjacency matrix | Use to_scipy_sparse_array() instead of to_numpy_array() |
| Node IDs read as strings from file | read_edgelist defaults to str | Pass nodetype=int: nx.read_edgelist(f, nodetype=int) |
| Community detection returns frozen sets | Normal return type for communities | Convert: [list(c) for c in communities] |
| Self-loops in generated graphs | Configuration model allows self-loops | Remove: G.remove_edges_from(nx.selfloop_edges(G)) |
| Visualization too cluttered | Too many nodes/edges | Filter to subgraph, adjust alpha, increase figure size, or use interactive tools (Plotly, PyVis) |
Migrated from original entry (STUB: 436-line main file + 2,014 lines across 5 reference files, main/total = 17.8%).
Covers: Detailed algorithm parameters for traversal (DFS/BFS), cycles, cliques, graph coloring, isomorphism, matching/covering, tree algorithms (MST variants). Full generator catalog: classic graphs, lattice/grid, tree, bipartite, degree sequence, graph operations (union, compose, complement, products). Relocated inline: Core algorithms (centrality, paths, connectivity, community, flow) -> Core API Modules 3-5. Core generators (ER, BA, WS, SBM) -> Module 8. Omitted: A* heuristic customization, Bellman-Ford negative weights -- consult official docs.
Original file disposition:
algorithms.md (383 lines): Top algorithms relocated to Core API Modules 3-5 + Recipes. Remaining (traversal, cliques, coloring, isomorphism, matching, cycles, trees) -> this reference.generators.md (378 lines): Core generators relocated to Module 8. Full catalog (classic, lattice, tree, bipartite, degree sequence, operators) -> this reference.Covers: All I/O formats (adjacency list, GEXF, Pajek, LEDA, Cytoscape JSON, DOT/Graphviz, Matrix Market, CSV, database/SQL, compressed gzip). Format selection guide. Advanced visualization: Plotly interactive, PyVis HTML, Graphviz layouts, 3D networks, bipartite layout, community coloring, subgraph highlighting, multi-panel figures, edge labels, directed arrows.
Relocated inline: Core I/O (edge list, GraphML, JSON, pandas, NumPy/SciPy) -> Module 6. Basic matplotlib -> Module 7.
Omitted: write_gpickle/read_gpickle (deprecated), read_shp/write_shp (removed in NetworkX 3.0; use geopandas).
Original file disposition:
io.md (441 lines): Core formats relocated to Module 6. Remaining formats + format selection guide -> this reference.visualization.md (529 lines): Basic matplotlib relocated to Module 7. Advanced techniques (Plotly, PyVis, 3D, bipartite, community coloring) -> this reference.graph-basics.md (283 lines): Fully consolidated into main SKILL.md. Graph types -> Key Concepts. Node/edge operations, attributes, subgraphs -> Core API Modules 1-2. Diagnostics -> Common Workflows. Memory/float-point considerations -> Best Practices + Troubleshooting. Omitted: nx.info() (deprecated).nx.drawtools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.