plugins/trading-operations/skills/operational-risk/SKILL.md
Guide identification, measurement, and management of operational risk in trading and brokerage operations. Use when designing trade error detection and correction procedures, investigating trade breaks and reconciliation failures, classifying loss events under Basel taxonomy, developing key risk indicators (KRIs) and dashboards, responding to system outages or data feed failures or order routing errors, conducting root cause analysis after a trade error or operational incident, planning business continuity and disaster recovery for trading desks, preparing for FINRA or SEC operational risk examinations, or assessing technology risk in OMS and market data systems. Also covers fat-finger errors, error account P&L, and corrective action tracking.
npx skillsauth add joellewis/finance_skills operational-riskInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Operational risk is the risk of loss resulting from inadequate or failed internal processes, people, and systems, or from external events. The Basel Committee's seven event-type categories map to trading operations as follows:
| Basel event type | Trading-operations examples | |---|---| | 1. Internal fraud | Unauthorized trading, intentional position mismarking, fictitious trade booking, front-running | | 2. External fraud | Account takeover, phishing for trade credentials, wire fraud in settlement instructions, counterparty manipulation | | 3. Employment practices and workplace safety | Inadequate operations training, key-person dependency, error-inducing workload | | 4. Clients, products, and business practices | Suitability failures, improper execution, best execution violations, failure to follow client instructions | | 5. Damage to physical assets | Data center or trading floor damage from natural disasters or civil disruption | | 6. Business disruption and system failures | OMS outages, market data feed failures, connectivity loss, exchange gateway and clearing system downtime | | 7. Execution, delivery, and process management | Trade errors, settlement fails, reconciliation breaks, failed corporate action processing, incorrect margin calculations (typically the largest loss category) |
Risk identification involves cataloging all operational risk exposures through process mapping, risk and control self-assessments (RCSAs), loss event analysis, scenario analysis, and audit findings. Risk assessment scores each risk on likelihood and impact dimensions, typically using a 5x5 heat map. Risk monitoring tracks KRIs, loss events, and control effectiveness. Risk mitigation applies controls (preventive and detective), process redesign, technology solutions, insurance, and business continuity planning.
A trade error occurs when a transaction is executed incorrectly due to human mistake, system malfunction, or miscommunication. Common trade error types include:
Error detection methods. Errors are detected through: real-time position monitoring (unexpected position changes trigger alerts), pre-trade validation rules (quantity limits, security restrictions, account eligibility checks), post-trade reconciliation (comparing expected vs. actual positions), client complaints, clearing firm or counterparty rejection notices, and P&L attribution (unexplained P&L often signals an error).
Error correction procedures. Once detected, errors must be corrected promptly:
A trade break occurs when two records of the same transaction do not match. Breaks arise at multiple points in the trade lifecycle:
Reconciliation process. Firms conduct three primary types of reconciliation:
Break resolution workflow. A typical break resolution process includes: (1) automated matching to clear breaks that are within tolerance thresholds (e.g., price differences under $0.01, quantity differences due to rounding); (2) assignment of unresolved breaks to operations analysts; (3) investigation to identify the root cause; (4) correction of the erroneous record in the appropriate system; (5) confirmation with the counterparty or custodian that the break is resolved; (6) documentation of the resolution and root cause.
Aging and escalation. Unresolved breaks are tracked by age. Industry standards and regulatory expectations require escalation based on aging thresholds:
| Age | Status | Action | |---|---|---| | T+0 to T+1 | Normal | Investigate and resolve in the ordinary course | | T+2 to T+3 | Attention | Escalate to senior operations staff; increase priority | | T+4 to T+5 | Warning | Escalate to operations management; engage counterparty directly | | T+5+ | Critical | Escalate to head of operations and compliance; assess financial exposure |
Tolerance thresholds. Firms establish tolerance levels below which breaks are auto-resolved. Common thresholds: price tolerance of +/- $0.01 per unit for exchange-traded securities, quantity tolerance of +/- 1 unit for rounding differences, and cash tolerance of +/- $1.00 for minor rounding. Tolerances must be reviewed periodically and should not be set so wide as to mask genuine errors.
Loss events are actual losses resulting from operational risk incidents. Effective loss event management requires:
Loss event identification. Sources include trade error P&L, settlement fail charges (buy-in costs, overdraft interest), regulatory fines and penalties, litigation settlements, system outage costs (missed trades, manual processing costs), and compensation payments to clients for service failures.
Loss event classification. Each loss event is classified by:
Loss event documentation. Each event record should include: date of occurrence, date of discovery, date of resolution, description of the event, root cause, Basel category, business line, gross loss amount, recoveries (insurance, counterparty reimbursement), net loss amount, corrective actions taken, and responsible manager.
Near-miss tracking. Events that could have resulted in a loss but did not (due to timely detection or favorable market movement) are tracked as near-misses. Near-misses are leading indicators of control weaknesses and are analyzed alongside actual losses. Example: a fat finger error that was caught by a pre-trade quantity limit before execution is a near-miss.
Loss event database. Firms maintain an internal loss event database (often part of a GRC — Governance, Risk, and Compliance — platform) that aggregates all loss events across the organization. The database enables trend analysis, root cause pattern identification, and reporting to senior management and the board.
Threshold reporting. Firms establish reporting thresholds:
| Threshold | Action | |---|---| | > $10,000 | Report to department head within 24 hours | | > $50,000 | Report to Chief Risk Officer within 24 hours | | > $100,000 | Report to senior management and Risk Committee | | > $500,000 | Board notification; assess regulatory reporting obligations |
These thresholds are illustrative; each firm calibrates to its size, complexity, and risk appetite.
Regulatory notification. Certain loss events trigger regulatory reporting obligations. FINRA Rule 4530 requires member firms to report specified events, including significant operational incidents. SEC Rule 17a-11 requires broker-dealers to notify the SEC of certain financial and operational conditions. Firms must maintain a matrix mapping loss event types and thresholds to applicable regulatory notification requirements.
KRIs are metrics that provide early warning of increasing operational risk exposure. They are distinguished from key performance indicators (KPIs) in that KRIs are specifically designed to signal risk rather than measure performance, though some metrics serve both purposes.
Leading vs. lagging indicators. Leading indicators predict future risk events (e.g., rising system latency may predict an outage). Lagging indicators measure events that have already occurred (e.g., number of trade errors last month). An effective KRI program includes both types.
Common trading operations KRIs:
| KRI | Definition | Leading/Lagging | |---|---|---| | NIGO rate | Not-In-Good-Order rate: percentage of trade instructions received with missing or incorrect information | Leading | | Trade break rate | Number of unmatched trades as a percentage of total trades | Lagging | | Settlement fail rate | Number of failed settlements as a percentage of total settlements | Lagging | | Trade error rate | Number of trade errors per 1,000 trades executed | Lagging | | Error account balance | Aggregate dollar value of positions in error accounts | Lagging | | STP rate | Straight-Through Processing rate: percentage of trades processed without manual intervention | Leading | | System availability | Uptime percentage of critical trading and operations systems | Leading | | Margin call volume | Number and dollar value of margin calls issued or received | Leading | | Aged break count | Number of trade breaks older than the escalation threshold | Leading | | Cancel/correct ratio | Number of trade cancellations and corrections as a percentage of total trades | Lagging | | Reconciliation completion rate | Percentage of daily reconciliations completed by the target deadline | Leading | | Open incident count | Number of unresolved operational incidents | Leading |
KRI thresholds. Each KRI is assigned threshold levels using a traffic-light model:
Example threshold calibration for trade break rate:
| Level | Threshold | Action | |---|---|---| | Green | < 2% of daily trade volume | Routine monitoring | | Amber | 2% - 5% of daily trade volume | Investigate root cause; increase reconciliation frequency | | Red | > 5% of daily trade volume | Escalate to Head of Operations; halt new activity if warranted |
KRI trending and reporting. KRIs are tracked over time to identify trends. A KRI that remains in the green zone but is trending upward toward amber is more informative than a snapshot reading. Monthly KRI reports to management should include current values, threshold status, trend direction, and commentary on any amber or red indicators.
Operational incidents in trading operations range from minor system glitches to major outages that affect market participation. A structured incident management process ensures consistent response and resolution.
Incident classification (severity levels):
| Severity | Definition | Examples | Response Time | |---|---|---|---| | SEV-1 (Critical) | Complete loss of trading capability or significant financial exposure | Order management system down; inability to route orders to any exchange; clearing system failure preventing settlement | Immediate; all-hands response | | SEV-2 (Major) | Significant degradation of trading capability or material financial risk | Market data feed failure for a major exchange; inability to process a specific order type; partial connectivity loss | Within 15 minutes | | SEV-3 (Moderate) | Limited impact on trading operations; workaround available | Slow system performance; failure of a non-critical reporting function; single counterparty connectivity issue | Within 1 hour | | SEV-4 (Minor) | Minimal operational impact; no financial exposure | Cosmetic UI issues; non-urgent report delays; minor data quality issues with no trade impact | Within 4 hours |
Incident response procedures. A standard incident lifecycle includes:
Escalation matrix. The escalation path is defined by severity level:
Root cause analysis techniques. Two widely used methods:
Corrective action tracking. Every root cause analysis produces corrective actions. Each action is assigned an owner, a target completion date, and a status (open, in progress, completed, verified). A corrective action register is maintained and reviewed at regular operational risk meetings. Corrective actions are not considered closed until they have been independently verified as effective.
Trading operations must maintain the ability to continue critical functions during disruptive events. Regulatory requirements (including FINRA Rule 4370) mandate business continuity planning for broker-dealers.
FINRA Rule 4370 (Business Continuity Plans and Emergency Contact Information). Every FINRA member must create and maintain a written business continuity plan (BCP) that addresses, at a minimum: data backup and recovery, all mission-critical systems, financial and operational assessments, alternate communications with customers and regulators, alternate physical location, critical business constituent impact, regulatory reporting, and communications with regulators. The plan must be updated in the event of any material change to the firm's operations, structure, business, or location.
Recovery Time Objective (RTO). The maximum acceptable duration of a system outage before the business impact becomes unacceptable. For trading operations, RTOs are typically measured in minutes to hours:
| System | Typical RTO | |---|---| | Order management system | < 30 minutes | | Market data feeds | < 15 minutes | | Exchange connectivity | < 15 minutes | | Risk management system | < 1 hour | | Settlement/clearing interface | < 2 hours | | Client reporting systems | < 4 hours |
Recovery Point Objective (RPO). The maximum acceptable amount of data loss measured in time. An RPO of 5 minutes means the firm can tolerate losing at most 5 minutes of transaction data. For trading systems, RPOs are typically near-zero (synchronous replication) for order and execution data, and minutes for less critical data.
Failover procedures. Critical systems should have automated or semi-automated failover to secondary environments. This includes: active-passive database replication with automated promotion of the standby, redundant network paths to exchanges and clearing firms, geographically separated data centers, and pre-configured disaster recovery trading environments.
Remote trading capabilities. Firms must ensure that traders and operations staff can operate from alternate locations. This includes: VPN access to trading systems, pre-provisioned remote trading workstations, tested voice communication (trading turrets, recorded phone lines) from remote locations, and documented procedures for activating remote trading.
Communication plans. During a disruption, the firm must communicate with: clients (regarding order status, account access, and alternate contact methods), regulators (FINRA, SEC, exchanges), counterparties and clearing firms, employees, and critical vendors. Contact trees and communication templates should be pre-established and tested.
Testing requirements. FINRA Rule 4370 requires that BCPs be reviewed and tested at least annually. Industry best practice includes: tabletop exercises (walkthrough of scenarios), functional testing of backup systems and failover, full-scale simulation exercises, and third-party testing with exchanges and clearing firms. Test results should be documented and deficiencies addressed through corrective actions.
Technology risk is a subset of operational risk that is particularly acute in trading operations due to the dependence on automated systems for order routing, execution, risk management, and settlement processing.
System reliability. Trading systems must meet high availability standards. Common targets are 99.95% uptime (approximately 4.4 hours of allowable downtime per year) for mission-critical systems. Reliability is achieved through redundant architecture, automated monitoring, capacity planning, and regular performance testing.
Change management. Software and configuration changes to trading systems are a leading source of operational incidents. A disciplined change management process includes: change request documentation, impact assessment, testing in non-production environments, scheduled deployment windows (avoiding market hours for high-risk changes), rollback procedures, and post-deployment verification. Emergency changes during market hours require expedited approval with heightened risk awareness.
Vendor risk management. Trading operations depend on numerous third-party vendors for market data, order routing, clearing, settlement, and technology infrastructure. Vendor risk management includes: due diligence before onboarding, service level agreements (SLAs) with measurable performance standards, ongoing monitoring of vendor performance and financial health, contingency plans for vendor failure, and concentration risk assessment (avoiding excessive dependence on a single vendor for critical functions).
Cybersecurity in trading systems. Trading systems are high-value targets for cyberattack. Key cybersecurity controls include: network segmentation to isolate trading systems, multi-factor authentication for system access, encryption of data in transit and at rest, intrusion detection and prevention systems, regular penetration testing, and incident response plans specific to cyber events.
Market data system failures. Loss of market data (prices, quotes, reference data) can prevent accurate order pricing, risk calculation, and compliance checking. Firms should maintain: redundant market data feeds from multiple vendors, fallback pricing mechanisms (last known price, manual price entry with controls), and alerts for stale or missing data. Market data failures that affect order routing or execution quality should be classified and managed as operational incidents.
Order routing system failures. Inability to route orders to exchanges or market centers is a SEV-1 incident for a trading operation. Controls include: redundant FIX connections to each execution venue, alternative order routing paths, manual order entry capabilities at exchange terminals as a last resort, and pre-established procedures for notifying clients of execution delays.
Three worked examples are in references/examples.md — load for an end-to-end scenario: (1) building an operational risk framework for a broker-dealer's trading desks, (2) designing a standardized trade error handling and correction process, (3) implementing a KRI dashboard for trading operations management.
testing
Model, forecast, and interpret volatility using time-series models and options-implied measures. Use when the user asks about EWMA, GARCH models, implied volatility, volatility surfaces, volatility term structure, or the VIX. Also trigger when users mention 'volatility smile', 'volatility skew', 'realized vs implied vol', 'volatility risk premium', 'vol clustering', 'mean-reverting volatility', 'options pricing inputs', 'RiskMetrics', 'decay factor', or ask how to forecast future volatility for risk management.
testing
Execute a complete tax-loss harvesting workflow from candidate identification through post-harvest monitoring. Use when the user asks about finding TLH candidates, gain/loss budgeting, replacement security selection, wash-sale compliance, or harvest execution planning. Also trigger when users mention 'unrealized losses in my portfolio', 'swap ETFs for tax purposes', 'harvest losses before year-end', 'substantially identical security', 'wash-sale window', 'NIIT offset', 'loss carryforward', or ask how much tax they can save by harvesting.
testing
Maximizes after-tax returns through strategic asset location, gain/loss management, and withdrawal sequencing. Use when the user asks about asset location, Roth conversions, tax-efficient withdrawals, tax lot selection, or charitable giving with appreciated securities. Also trigger when users mention 'which account should I hold bonds in', 'tax drag', 'Roth vs Traditional', 'RMD planning', 'bracket stuffing', 'HIFO vs FIFO', or ask how to minimize taxes on investments. For tax-loss harvesting execution and wash-sale mechanics, see the tax-loss-harvesting skill.
development
Plan and track savings for specific financial goals including retirement, education, and home purchase. Use when the user asks about required savings rates, 529 plans, retirement accumulation targets, down payment planning, or goal prioritization. Also trigger when users mention 'how much do I need to save each month', 'am I on track for retirement', 'college savings', 'safe withdrawal rate', '4% rule', 'FIRE savings rate', 'catch-up contributions', 'employer match', or ask how to balance competing savings goals.