Price Per TokenPrice Per Token

Used ray tracing cores on my RTX 5070 Ti for LLM routing — 218x speedup, runs entirely on 1 consumer GPU

Quick summary: I found a way to use the RT Cores (normally used for ray tracing in games) to handle expert routing in MoE models. Those cores sit completely idle during LLM inference, so why not put them to work?

What it does:

  • Takes the routing decision in MoE models (which experts process which tokens)
  • Projects tokens into 3D space
  • Uses the GPU's dedicated ray tracing hardware to find the right experts
  • O(log N) instead of O(N) — hardware-accelerated

Numbers (OLMoE-1B-7B, RTX 5070 Ti 16GB):

  • 218x faster routing at batch 1024
  • 731x less VRAM for routing
  • Only +1.5% perplexity hit
  • 95.9% routing accuracy

Unexpected discovery: I also found that MoE experts don't actually specialize by topic. Tested across 3 different models (OLMoE, Qwen-MoE, DeepSeek-MoE) — they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

Code repo: https://github.com/JordiSilvestre/Spectral-AI All papers are open access on Zenodo with full data and reproduction instructions: https://doi.org/10.5281/zenodo.19457288

10
to join the discussion.

10 comments

If I understand correctly, this achieves the speedup by just not calculating attention and replacing it with something completely different. This will, obviously, cause significant degradation. I see you didn’t do any testing beyond HellaSwag, I recommend you test a benchmark that requires long context understanding.

Also, why’d you have your AI that wrote this entire thing make up a bunch of your comparison numbers? GPT-4 is not public, all your numbers regarding it are completely hallucinated.

Not to mention, I see you exclusively tested on models that are ancient. I’m assuming that’s because those were all the ones ChatGPT knew about? Like cmon man, Qwen1.5? Be serious.

1 pt

Interesting project!

1 pt

Thanks a lot!

1 pt

I also found that MoE experts don't actually specialize by topic.

How did you not know that already?

1 pt

You're right that it's somewhat known or hinted at in literature (like the Mixtral paper), but the "topic specialist" analogy (the math expert, the coding expert) is still incredibly pervasive in tutorials, blogs, and even some AI marketing.

My goal wasn't just to point out the myth, but to strictly quantify it across three very different architectures (OLMoE, Qwen, DeepSeek). I needed to prove if this syntactic clustering (content words vs function words vs punctuation) held universally despite completely different model sizes (7B to 16B) and routing strategies (top-4 vs top-8).

Understanding exactly how they group syntactically (and confirming the U-shaped selectivity curve across layers) was a mandatory step before determining if I could efficiently route them geometrically using the RT Cores.

1 pt
grumd·21d ago

they all specialize by syntactic type (content words vs function words vs punctuation). The "science expert" is a myth.

If this is true then it makes sense why REAP models never worked for me

1 pt

Exactly! The widespread assumption that we can control or edit MoEs by patching or routing to specific "concept experts" usually fails because you aren't isolating a topic—you're just accidentally amplifying or suppressing verbs, transition words, or punctuation.

In the second paper (Expert Specialization), I did a deep dive into this exact phenomenon. The absolute "best" topic-specialized expert across all 3 models only activated 2.3x above a uniform random baseline, which is negligible. Meanwhile, the syntactic clustering was incredibly sharp and consistent layer by layer.

It completely shifts the paradigm on how we should approach Representation Engineering or expert-level patching!

1 pt
grumd·21d ago

Pretty sure you're writing both the post and comments using AI, can you tell me why?

1 pt

fair catch! 😅 English is not my native language (I'm an independent researcher from Spain).

Because the questions here are highly technical, I've been feeding my Spanish thoughts into an LLM to help me translate and structure my replies. I want to make sure I'm explaining the paper findings clearly without messing up the grammar.

I wrote all the CUDA/OptiX kernels and did the research myself, but I definitely rely on AI as my "English PR assistant" today. Apologies if it sounded a bit too robotic! The code in the repo is 100% real and human-made.

1 pt

Why does anyone do that? It’s laziness lol

1 pt