Buck

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Posts

Sorted by New

12Buck's Shortform

Ω

6y

Ω

174

75Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Ω

1mo

Ω

1

43Handling schemers if shutdown is not an option

Ω

2mo

Ω

1

122Ctrl-Z: Controlling AI Agents via Resampling

Ω

2mo

Ω

0

29How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Ω

2mo

Ω

1

34Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Ω

3mo

Ω

0

130Some articles in “International Security” that I enjoyed

4mo

10

57A sketch of an AI control safety case

Ω

4mo

Ω

0

139Ten people on the inside

Ω

4mo

Ω

28

27Early Experiments in Human Auditing for AI Control

Ω

5mo

Ω

0

91Thoughts on the conservative assumptions in AI control

Ω

5mo

Ω

5

Wikitag Contributions

Comments

Sorted by

Newest

johnswentworth's Shortform

Buck4h185

The classic setting is a party (a place where you meet potential romantic partners who you don't already know (or who you otherwise know from professional settings where flirting is inappropriate), and where conversations are freely starting and ending, such that when you start talking to someone the conversation might either go for two minutes or four hours).

Examples of hints:

Mentioning things that indicate that you're romantically available, e.g. saying that you're single, that you're poly, telling a story of recently going on a date; more extreme would be telling a story of doing something promiscuous.
Mentioning things that indicate that you want to relate to the other person in a romantic or sexual context rather than a non-sexual way. For example, a woman talking about how she likes wearing revealing clothes, or commenting on her body or my body. And then responding positively to that kind of statement, e.g. building on it rather than demurring, replying flatly, or changing the subject,
Offering and accepting invitations to spend more time interacting one-on-one, especially in semi-private places. E.g. asking to sit together. (For example, person A might say "I'm getting a drink, want me to get you one?", which is sort of an invitation to have a drink together, and person B might say "sure, let's sit over there to have it", which escalates the invitation to involve them talking for longer.)
Giving and accepting opportunities for physical contact.

In all cases, saying those things is more flirty if it was unnecessary for them to say it. E.g. if they say they're single because it came up in conversation in a way that they couldn't have contrived, that's less flirty than if they tell a story that brings it up.

I think that online content on all this stuff is often pretty accurate.

Reply

1

johnswentworth's Shortform

Buck7h2219

Yes, I've had this experience many times and I'm aware of many other cases of it happening lots of times.

Maybe the proliferation of dating apps means that it happens somewhat less than it used to, because when you meet up with someone from a dating app, there's a bit more common knowledge of mutual interest than there is when you're flirting in real life?

Reply

1

Mikhail Samin's Shortform

Buck2d4922

I think it's accurate to say that most Anthropic employees are abhorrently reckless about risks from AI (though my guess is that this isn't true of most people who are senior leadership or who work on Alignment Science, and I think that a bigger fraction of staff are thoughtful about these risks at Anthropic than other frontier AI companies). This is mostly because they're tech people, who are generally pretty irresponsible. I agree that Anthropic sort of acts like "surely we'll figure something out before anything catastrophic happens", and this is pretty scary.

I don't think that "AI will eventually pose grave risks that we currently don't know how to avert, and it's not obvious we'll ever know how to avert them" immediately implies "it is repugnant to ship SOTA tech", and I wish you spelled out that argument more.

I agree that it would be good if Anthropic staff (including those who identify as concerned about AI x-risk) were more honest and serious than the prevailing Anthropic groupthink wants them to be.

Reply

Kaj's shortform feed

Buck2d150

Will Eden, long-time rationalist, wrote about this in 2013 here.

Reply

MichaelDickens's Shortform

Buck1mo52

I think the LTFF is a pretty reasonable target for donations for donors who aren't that informed but trust people in this space.

Reply

1

Orienting Toward Wizard Power

Buck1mo100

To be clear, I think we at Redwood (and people at spiritually similar places like the AI Futures Project) do think about this kind of question (though I'd quibble about the importance of some of the specific questions you mention here).

Reply

PSA: The LessWrong Feedback Service

Buck1mo470

Justis has been very helpful as a copy-editor for a bunch of Redwood content over the last 18 months!

Reply

1

Orienting Toward Wizard Power

Buck1mo123

I think that if you wanted to contribute maximally to a cure for aging (and let's ignore the possibility that AI changes the situation), it would probably make sense for you to have a lot of general knowledge. But that's substantially because you're personally good at and very motivated by being generally knowledgeable, and you'd end up in a weird niche where little of your contribution comes from actually pushing any of the technical frontiers. Most of the credit for solving aging will probably go to people who either narrowly specialized in a particular domain; much of the rest will go to people who applied their general knowledge to improving the overall strategy or allocation of effort among people who are working on curing aging (while leaving most of the technical contributions to specialists)--this latter strategy crucially relies on management and coordination and not being fully in the weeds everywhere.

Reply

Orienting Toward Wizard Power

Buck1mo*668

Thanks for this post. Some thoughts:

I really appreciate the basic vibe of this post. In particular, I think it's great to have a distinction between wizard power and king power, and to note that king power is often fake, and that lots of people are very tempted (including by insidious social pressure) to focus on gunning for king power without being sufficiently thoughtful about whether they're actually achieving what they wanted. And I think that for a lot of people, it's an underrated strategy to focus hard on wizard power (especially when you're young). E.g. I spent a lot of my twenties learning computer science and science, and I think this was quite helpful for me.
A big theme of Redwood Research's work is the question "If you are in charge of deploying a powerful AI and you have limited resources (e.g. cash, manpower, acceptable service degradation) to mitigate misalignment risks, how should you spend your resources?". (E.g. see here.) This is in contrast to e.g. thinking about what safety measures are most in the Overton window, or which ones are easiest to explain. I think it's healthy to spend a lot of your time thinking about techniques that are objectively better, because it is less tied up in social realities. That attitude reminds me of your post.
I share your desire to know about all those things you talk about. One of my friends has huge amounts of "wizard power", and I find this extremely charming/impressive/attractive. I would personally enjoy the LessWrong community more if the people here knew more of this stuff.
I'm very skeptical that focusing on wizard power is universally the right strategy; I'm even more skeptical that learning the random stuff you list in this post is typically a good strategy for people. For example, I think that it would be clearly bad for my effect on existential safety for me to redirect a bunch of my time towards learning about the things you described (making vaccines, using CAD software, etc), because those topics aren't as relevant to the main strategies that I'm interested in for mitigating existential risk.
You write "And if one wants a cure for aging, or weekend trips to the moon, or tiny genetically-engineered dragons… then the bottleneck is wizard power, not king power." I think this is true in a collective sense--these problems require technological advancement--but it is absurd to say that the best way to improve the probability of getting to those things is to try to personally learn all of the scientific fields relevant to making those advancements happen. At the very least, surely there should be specialization! And beyond that, I think the biggest threat to eventual weekend trips to the moon is probably AI risk; on my beliefs, we should dedicate way more effort to mitigating AI risk than to tiny-dragon-R&D. Some people should try to have very general knowledge of these things, but IMO the main usecase for having such broad knowledge is helping with the prioritization between them, not contributing to any particular one of them!

Reply

KvmanThinking's Shortform

Buck1mo60

This kind of idea has been discussed under the names "surrogate goals" and "safe Pareto improvements", see here.

Reply