LESSWRONG
LW

Diogo de Lucena

Chief Scientist at AE Studio

Posts

Sorted by New

80Mistral Large 2 (123B) seems to exhibit alignment faking

3mo

4

155Reducing LLM deception at scale with self-other overlap fine-tuning

3mo

41

100Science advances one funeral at a time

7mo

9

91Self-prediction acts as an emergent regularizer

8mo

9

75The case for a negative alignment tax

9mo

20

223Self-Other Overlap: A Neglected Approach to AI Alignment

10mo

51

27Video Intro to Guaranteed Safe AI

1y

0

67AE Studio @ SXSW: We need more AI consciousness research (and further resources)

1y

8

Wikitag Contributions

Comments

Sorted by