TPRT
Inline Token Pruning Accelerator
TPRT is a hardware accelerator that sits between AI inference producers and the network, selectively pruning low-value data to reduce bandwidth by up to 50%.
Unlike software approaches that run on the same compute as inference, TPRT operates as a standalone inline device - intercepting, evaluating, and optimising AI traffic without modifying upstream or downstream architecture.
Inline Placement
Data flows through TPRT between inference producer and interconnect. The module intercepts AI traffic non-invasively, requiring no changes to existing accelerator or network architecture.
Confidence-Driven Path Direction
When the confidence margin indicates pruning is safe, low-value tokens are dropped and bandwidth is reduced. When confidence is low, data passes through the bypass path unmodified — switching in under 10 nanoseconds.
Receiver Feedback and Fail-Open Design
The receiver monitors reconstruction quality and feeds metrics back to TPRT, enabling real-time adaptation. If quality degrades or errors spike, the system fails open — data flows through unmodified, mission continues.
KEY FEATURES
Standalone inline placement — between producer and interconnect, transparent to existing systems
Sub-10ns bypass switching — prune when safe, pass through when uncertain
Receiver feedback loop — adapts pruning aggressiveness based on real-time quality metrics
Fail-open default — hardware bypass ensures data flow even if optimisation fails
Token pruning, not quantization — actual data removal, not precision reduction
APPLICATIONS
Domain
Use Case
KV cache bandwidth reduction for LLM inference
Datacenter
Resilient AI for bandwidth-constrained tactical network
Defense
Gradient compression across interconnects
Distributed Training
Local inference traffic optimisation
Edge AI
COMPETITIVE POSITION
We have validated TPRT's architecture against published hardware approaches including KAIST's Oaken system (ISCA 2025), CXL-based memory tiers, and commercial SmartNIC/DPU platforms.
No existing hardware combines standalone inline placement, sub-10ns bypass, receiver feedback, fail-open behaviour, and token pruning.
Patents pending (UK)