AI language models are slow on small computers — not because of the model weights, but because of attention: the mechanism that lets every word look at every other word in the text. When you double the text length, attention gets four times harder, not twice.
ruvllm_sparse_attention fixes this by teaching the model to be selective. Instead of every word looking at every other word, it looks at:
- The words closest to it (recent context)
- A few anchor words at the start (global signals)