Neel Nanda

Sequences

How I Think About My Research Process
GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wikitag Contributions

Comments

Sorted by

I continue to feel like we're talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there's a really good reason breaking them seems locally beneficial, because on average, the decision theory that's willing to do harmful things for complex reasons performs badly.

The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone's consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.

Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.

When analysing the reasons I broadly buy the deontological framework for "don't commit murder", I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.

Regarding your examples, I'm completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.

Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I'm strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what's disanalogous?

COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.

You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.

Again, I'm purely defending the abstract point of "plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden". You're welcome to critique Anthropic's actual actions as much as you like. But you seem to be making a much more general claim.

Neel NandaΩ220

Adding some clarifications re my personal perspective/takes on how I think about this from an AGI Safety perspective: I see these ideas as Been's brainchild, I largely just helped out with the wording and framing. I do not currently plan to work on agentic interpretability myself, but still think the ideas are interesting and plausibly useful, and I’m glad the perspective is written up! I still see one of my main goals as working on robustly interpreting potentially deceptive AIs and my guess is this is not the comparative strength of agentic interpretability.

Why care about it? From a scientific perspective, I'm a big fan of baselines and doing the simple things first. "Prompt the model and see what happens" or "ask the model what it was doing" are the obvious things you should do first when trying to understand a behaviour. In internal experiments, we often find that we can just solve a problem with careful and purposeful prompting, no need for anything fancy like SAEs or transcoders. But it seems kinda sloppy to “just do the obvious thing”, I’m sure there’s a bunch of nuance re doing this well, and in training models for this to be easy to do. I would be excited for there to be a rigorous science of when and how well these kinds of simple black box approaches actually work. This is only part of what agentic interpretability is about (there’s also white box ideas, more complex multi-turn stuff, an emphasis on building mental models of each other, etc) but it’s a direction I find particularly exciting – If nothing else, we need to answer to know where other interpretability methods can add value.

It also seems that, if we're trying to use any kind of control or scalable oversight scheme where we're using weak trusted models to oversee strong untrusted models, that the better we are at having high fidelity communication with the weaker models the better. And if the model is aligned, I feel much more excited about a world where the widely deployed systems are doing things users understand rather than inscrutable autonomous agents.

Naturally, it's worth thinking about negative externalities. In my opinion, helping humans have better models of AI Psychology seems robustly good. AIs having better models of human psychology could be good for the reasons above, but there's the obvious concern that it will make models better at being deceptive, and I would be hesitant to recommend such techniques to become standard practice without better solutions to deception. But I expect companies to eventually do things vaguely along the lines of agentic interpretability regardless, and so either way I would be keen to see research on the question of how such techniques affect model propensity and capability for deception.

Neel NandaΩ330

Agreed EG model that is corrigible, fairly aligned but knows there's some imperfections in its alignment that the humans wouldn't want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it's doing gradient hacking while also in some meaningful sense being aligned

OK, I'm going to bow out of the conversation at this point, I'd guess further back and forth won't be too productive. Thanks all!

Ah! Thanks a lot for the explanation, that makes way more sense, and is much weaker than what I thought Ben was arguing for. Yeah this seems like a pretty reasonable position, especially "take actions where if everyone else took them we would be much better off" and I am completely fine with holding Anthropic to that bar. I'm not fully sold re the asking for consent framing, but mostly for practical reasons - I think there's many ways that society is not able to act constantly, and the actions of governments on many issues are not a reflection of the true informed will of the people, but I expect there's some reframe here that I would agree with.

Here’s two lines that I think might cross into being acceptable from my perspective.

I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.

I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.

I of course do not think any current project looks superficially like these.

Okay, after reading this it seems to me that we broadly do agree and are just arguing over price. I'm arguing that it is permissible to try to build a doomsday machine if there are really good reasons to believe it is net good for the probability of doomsday. It sounds like you agree, and give two examples of what "really good reasons" could be. I'm sure we disagree on the boundaries of where the really good reasons lie, but I'm trying to defend the point that you actually need to think about the consequences.

What am I missing? Is it that you think these two are really good reasons, not because of the impact on the consequences, but because of the attitude/framing involved?

And do you feel this way because you believe that the general policy of obeying such deontological prohibitions will on net result in better outcomes? Or because you think that even if there were good reason to believe that following a different policy would lead to better empirical outcomes, your ethics say that you should be deontologically opposed regardless?

I disagree with this as a statement about war, I'm sure a bunch of Nazi soldiers were conscripted, did not particularly support the regime, and were participating out of fear. Similarly, malicious governments have conscripted innocent civilians and kept them in line through fear in many unjust wars throughout history. And even people who volunteered may have done this due to being brainwashed by extensive propaganda that led to them believing they were doing the right thing. The real world is messy and strict deontological prohibitions break down in complex and high stakes situations, where inaction also has terrible consequences - I strongly disagree with a deontological rule that says countries are not about to defend themselves against innocent people forced to do terrible things

Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.

Yes that is correct

This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.

I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of "taking actions that may lead to death", which I think is more analogous - hopefully we can agree Anthropic won't intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone's probability of dying, while introducing some novel risks.

I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.

I think we're talking past each other. I understood you as arguing "deontological rules against X will systematically lead to better consequences than trying to evaluate each situation carefully, because humans are fallible". I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.

I am arguing that "things to do with human extinction from AI, when there's already a meaningful likelihood" are not a domain where ethical prohibitions like "never do things that could lead to human extinction" are productive. For example, you help run LessWrong, which I'd argue has helped raise the salience of AI x-risk, which plausibly has accelerated timelines. I personally think this is outweighed by other effects, but that's via reasoning about the consequences. Your actions and Anthropic's feel more like a difference in scale than a difference in kind.

Assuming this out of the hypothesis space will get you into bad ethical territory

I am not arguing that AI x-risk is inevitable, in fact I'm arguing the opposite. AI x-risk is both plausible and not inevitable. Actions to reduce this seem very valuable. Actions that do this will often have side effects that increase risk in other ways. In my opinion, this is not sufficient cause to immediately rule them out.

Meanwhile, I would consider anyone pushing hard to make frontier AI to be highly reckless if they were the only one who could cause extinction, and they could unilaterally stop - this is a way to unilaterally bring risk to zero, which is better than any other action. But Anthropic has no such action available, and so I want them to take the actions that reduce risk as much as possible. And there are arguments for proceeding and arguments for stopping.

The point I was trying to make is that, if I understood you correctly, you were trying to appeal to common sense morality that deontological rules like this are good on consequentialist grounds. I was trying to give examples why I don't think this immediately follows and you need to actually make object level arguments about this and engage with the counter arguments. If you want to argue for deontological rules, you need to justify why those rules

I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.

I don't think this follows from naive moral intuition. A crucial disanalogy with murder is that if you don't kill someone, the counterfactual is that the person is alive. While if you don't race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway. This means that we need to be engaging in discussion about the consequences of there being another actor pushing for this, the consequences of other actions this actor may take, and how this all nets out, which I don't feel like you're doing.

I expect AGI to be either the best or worse thing that has ever happened, and this means that important actions will typically be high variance, with major positive or negative consequences. Declining to engage in things with the potential for high negative consequences severely restricts your action space. And given that it's plausible that there's a terrible outcome even if we do nothing, I don't think the act-omission distinction applies

Load More