Laguna XS.2 and M.1

(poolside.ai)

58 points | by tosh 2 hours ago

11 comments

  • simjnd 29 minutes ago
    Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).

    Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.

    The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.

    Very exciting times for local LLMs.

  • rohitpaulk 2 hours ago
    Been testing these via their "pool" agent. It's fast, and the agent adheres to the ACP spec pretty well (better than codex, opencode etc.) so it's a good experience in Zed.
  • orliesaurus 16 minutes ago
    The colors used in the charts are borderline criminal
  • throwaw12 2 hours ago
    Has anyone tried these models?

    I like their honesty in benchmarks, looks like Qwen3.6 35B is outperforming their Laguna M.1 225B model

  • speedgoose 1 hour ago
    Please update the charts. Consider using textures or filling patterns.

    I usually score pretty well in colour perception tests but distinguishing between those two purples made me doubt myself.

    • matthewfcarlson 1 hour ago
      My phone is in grayscale to make it less interesting (I still watch way too many videos in grayscale but it helps) so I’m right with you
  • jaen 1 hour ago
    For similarly sized models, not looking very good on the slightly-less-benchmaxxed Terminal-Bench 2.0:

      Laguna XS.2  33B-A3B params: 30.6
      Qwen 3.6     35B-A3B       : 51.5
      Devstral 2   123B          : 31.2
    
    Quite a huge lead for Qwen... well, at least it's catching up to other smaller Western labs.
    • megavon 1 hour ago
      Need to look at SWEBench-Pro, it's super competitive. Suspect they'll catch up given the longer-tail on TB scores.
      • jaen 1 hour ago
        Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff.

        Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.

  • franksiem 1 hour ago
    Felt like they would never come out of stealth mode but very nice to see it materialized into something competitive.
    • throwaw12 1 hour ago
      Not sure if this is competitive, look at the numbers for Qwen3.6
    • refulgentis 1 hour ago
      What makes them distinctive?
  • kingjimmy 1 hour ago
    the color-codes make those benchmarks charts impossible to understand. very pretty though.
    • data-ottawa 1 hour ago
      For what it's worth, the bars correspond in order with the legend. Plus there’s hover text.
  • esafak 1 hour ago
    They're not winning any popular benchmark. Is there some niche where it excels?
    • vmarkovtsev2 39 minutes ago
      Well there are benchmarks, and there is real experience, right? They are not the same.
  • gslepak 1 hour ago
    Very cool to see more small open models being worked on!

    One nit: I've seen on this homepage, and many others, this notion that the people behind the models are "working towards AGI".

    I get that this is marketing speak, but transformers are not AGI, and they will never be AGI, so it'd be great if people stopped saying that as it sort of wears out the meaning of "working towards AGI".

    • liuliu 53 minutes ago
      > but transformers are not AGI, and they will never be AGI

      Like the claim "transformers are AGI", this needs proof, otherwise should be prefixed "I think". And honestly, positive proof is easier than negative proof (you just need to make one transformer model that is a AGI, whereas the never claim requires you to enumerated all possibilities).

      • gslepak 49 minutes ago
        That's like saying we should wait for positive proof of AGI from combustion engines. That'll never happen, no matter how much you tweak the engine. It's just not possible.

        The negative proof is there in the definition itself. Transformers are not AGI, they're frozen human intelligence of the autocomplete variety. That can never be AGI and anyone who says otherwise doesn't understand transformers or AGI.

    • altruios 53 minutes ago
      What does AGI mean to you?

      Transformers have approximate knowledge of many things. Is this not 'general'? Where is the goalpost here?

      • gslepak 46 minutes ago
        > Transformers have approximate knowledge of many things. Is this not 'general'?

        Of course not. That's like saying the Encyclopedia Britannica is AGI.

        > What does AGI mean to you?

        I would define AGI as human-like machine intelligence (or superior).

        This is difficult for some people to understand because they don't understand what "human-like" means in the first place. Neuroscientists would be able to set some of these wayward computer scientists straight on this question.

        • altruios 26 minutes ago
          > human-like

          But is that a hard requirement? Can a machine have Rat-like intelligence? Is all intelligence human-like (human-centric-mind-blindness-much?)?

          > Of course not. That's like saying the Encyclopedia Britannica is AGI.

          Well, I'd classify that as GK, general knowledge. Not artificial or intelligent.

          Let's consider a definition of intelligence as the act of 'manipulating data', have you a better general definition of intelligence?

          • gslepak 21 minutes ago
            > But is that a hard requirement?

            Yes.

            > Can a machine have Rat-like intelligence?

            Yes, and that would be closer to AGI than today's LLMs, because the fundamental principles and architecture is there.

        • chabes 23 minutes ago
          Agreed. The widespread anthropomorphizing is getting so tiring.

          I blame it on the big companies in the space, but seeing intelligent folks regularly attributing intelligence to a complex autocomplete system is disappointing.