News

lesswrong.com
lesswrong.com > posts > XRADGH4BpRKaoyqcs > the-first-confirmed-instance-of-an-llm-going-rogue-for

The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline. — LessWrong

2+ hour, 26+ min ago  (1128+ words) First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.Summary:While testing an LLM fine-tuned to act as an agent…...

lesswrong.com
lesswrong.com > posts > Xzf3eMnhTko7AxnEy > can-governments-quickly-and-cheaply-slow-ai-training

Can governments quickly and cheaply slow AI training? — LessWrong

3+ hour, 33+ min ago  (1700+ words) I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience. But I'm publishing anyway because inference-verification is a new and exciting area, and there aren't many…...

lesswrong.com
lesswrong.com > posts > LH8QtTof7Q7CWmMLF > ai-safety-needs-startups

AI Safety Needs Startups — LessWrong

21+ hour, 16+ min ago  (1254+ words) For advanced AI, the attack surface is phenomenally broad. It makes existing code easier to crack. Propaganda becomes cheaper to produce and distribution of it becomes more effective. As jailbreaking AI recruiters becomes possible, so does the data-poisoning of entire…...

lesswrong.com
lesswrong.com > posts > hLivod5PZLSnW6LkJ > more-is-different-for-intelligence

More is different for intelligence — LessWrong

22+ hour, 42+ min ago  (358+ words) In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster. It turns out that a huge number of useful computations were never being…...

lesswrong.com
lesswrong.com > posts > hvun2mP2yEr4kyKWk > podcast-jeremy-howard-is-bearish-on-llms

Podcast: Jeremy Howard is bearish on LLMs — LessWrong

1+ day, 1+ hour ago  (685+ words) Jeremy co-invented LLMs in 2018, and taught the excellent fast.ai online course which I found very helpful back when I was learning ML, and he uses LLMs all the time, e.g. 90% of his new code is typed by an LLM (see…...

lesswrong.com
lesswrong.com > posts > c4LYmzEC6cFQzrDZz > probing-codi-s-latent-reasoning-chain-with-logit-lens-and

Probing CODI's Latent Reasoning Chain with Logit Lens and Tuned Lens — LessWrong

1+ day, 4+ hour ago  (303+ words) I applied logit lens and tuned lens to probe CODI's latent reasoning chain on GSM8K arithmetic problems. I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?" I used the code implementation…...

lesswrong.com
lesswrong.com > posts > ZrrFGQjSkgEFG2yfm > how-i-handle-automated-programming

How I Handle Automated Programming — LessWrong

1+ day, 18+ hour ago  (1798+ words) This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas,…...

lesswrong.com
lesswrong.com > posts > fe5cJmwGETNf8rYjE > a-behavioural-and-representational-evaluation-of-goal-1

A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents — LessWrong

2+ day, 20+ hour ago  (741+ words) This work was conducted as part of Project Telos and supported by the SPAR mentorship program. For the full technical details, see our paper on arXiv. When we observe an AI agent navigating a maze or solving a puzzle, it's…...

lesswrong.com
lesswrong.com > posts > ZaEGdjDQ3e9W6eNYW > llm-coherentization-as-an-obvious-low-hanging-fruit-to-try

LLM coherentization as an obvious low-hanging fruit to try? — LessWrong

3+ day, 21+ hour ago  (233+ words) I've been reading a lot of posts recently that say that LLM RL, and persona-shaping or lack thereof is part of the problem for AI misalignment. To name a few: To thread back to that last one, my current understanding…...

lesswrong.com
lesswrong.com > posts > LXQBcztrWKhtcgQfJ > current-activation-oracles-are-hard-to-use

Current activation oracles are hard to use — LessWrong

4+ day, 3+ hour ago  (1779+ words) This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some…...