Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When it rains, it pours.

We have the SuperHOT LORA and the associated paper, the Salesforce model, MPT, and this. Any other long context news I am missing?




Yup, someone supposedly improved superhot scaling already.

Here you go, https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkawar...


Turns out that dynamic NTK-Aware scaling further improves on that, with perplexity performance equal to or better than non-scaled at all context lengths: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamic...


It just keeps getting better!

This is so exciting because contacts length has been a problem with these models for so long and it's awesome to see open source finally cracking that egg.

I need another GPU.

edit:

I'm actually a little bit skeptical of this. Yes it's dynamically scaling which is great when you have a model that's not fine tuned, but I think it's not going to work out too well when you try fine tune a model on a target that's moving like that. I'd rather one that stay static so that perplexity is always increasing up to the max rather than doing much those graph shows where it gets worse over time.

That said I don't really know what I'm talking about so maybe it'll be better regardless.


Yeah, there's a research paper idea sitting there for someone who wants to run the numbers on some more ablation tests and see if there are any unwanted side effects. Though if it gets the claimed performance on non-finetuned data, you may not need to fine tune it in the first place.

There's a symbiotic relationship between the open source community and the academic research here: the broader community can explore a ton of different improvements, at a much more rapid speed than either the academic research (slower because it has additional objectives) or closed-source (because the lack of sharing means that a lot of low-hanging fruit gets overlooked). Academic research can build the bigger, more experimental projects that create new capabilities. Research can also do the testing and research that the broader dev community doesn't have the time and resources for, giving us a better idea of what parts actually work best and why it works. It can be very valuable to have a paper that tells us something everybody knows, because you can verify the common assumptions empirically and give us numbers that tell us things like how well it works.

I expect to see a lot more discoveries that bounce back and forth between the audiences, because both groups benefit in different ways.


Ah yeah I confused rope with SuperHOT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: