apriiori

on reclaiming ai

stop doing bad philosophy of computer science

On Twitter recently, I recently encountered the paper Reclaiming AI as a theoretical tool for cognitive science (van Rooij, Guest, Adolfi, de Haan, Kolokolova, and Rich), which I found kinda annoying. It mentions stuff like people having the idea that AGI might replace specifically women11 To be absolutely clear, I think (in agreement with the authors of the paper (presumably)) that AGI replacing specifically women is absurd, that powerful people trying to use AGI to replace specifically women would be bad, and that anyone who thinks the gender of a laborer is a very good indicator of how easily their job can be automated must be not only sexist, but also an idiot.

My objection is that this is a weak man, and it’s the sort of weak man that suggests to me that the paper is not taking AI remotely seriously.
(???) which seemed weird, but I decided to focus on the computer science aspect. I encourage my readers to form their own opinions by reading the paper if they care, and to let me know if there’s anything interesting outside the computer science part.

Anyways. I’m here because I confused myself a few times thinking I had objections to the proof of their theorem2, and then decided to just write a blog post where I work it out.

Anyways, they specifically prove:

Theorem 1 (Ingenia Theorem):

AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is intractable.

Great. What does this mean? Well, van Rooji & al. seem to think this theorem establishes that producing an AGI through machine learning is completely intractable unless BPP contains NP—or at least that it’s strong evidence of that. And you know, in some ways it would be the best news I've ever heard if that were true, so it would be wonderful if they were right. I would love human level AI someday, but having to wait for a paradigm shift to something entirely outside machine learning! If only! Even the ultrafast research LLMs of the 2030s, outpacing most human researchers, could very well struggle a lot on the task of figuring out how to build a good old-fashioned AI, and maybe we could actually use the things they learn to build something aligned.

I think they’re probably wrong, though. Let’s get into the mathematics.


So, van Rooji & al. don’t define a decision problem, as someone with only a little experience in complexity theory might be used to, but rather something more akin to a search problem. The goal of AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is to find an algorithm A of length less than K which approximates a distribution 𝓓, if any such algorithm exists. Let’s go through some of the notation and definitions of various sets and such in a little detail.

They have a language, 𝓛33 They subscript some of these symbols but Substack makes subscripting annoying for no good reason and also the subscripts don’t really convey important information because they’re never talking about multiple distinct sets of algorithms at once or anything. I’m not bothering., which is some set containing descriptions L of algorithms A ∈ 𝓐. Each description L has a length, which could be given by an arbitrary44 You probably don’t want finding the length of a description L to take, like, an exponentially long time, or anything like that. function from 𝓛 to N, but to be nice I will say that you’ll probably want to imagine 𝓛 as containing strings in some finite alphabet55 And so checking if a string is of length K or less is logarithmic in K..

Each algorithm A is a function from {0,1}^n to {0,1}^m, where the input set is called S (for situation) and the output set is called B (for behavior). The idea is that A is supposed to approximate some distribution 𝓓 over S×B, in the sense that for any situation s ∈ S, we want A(s) to try to be a behavior in the set Bs of behaviors that are “appropriate (i.e. humanlike)” for that situation.

Since 𝓓 is meant to be a training distribution for learning human behavior, we will assume that the support of 𝓓 is required to only contain pairs (s, b) where b ∈ Bs. We might be meant to assume the stronger statement that Bs is precisely the set of behaviors b such that (s, b) is in the support of B, rather than perhaps being a strict superset of it—van Rooji & al. do not quite precisely define the relationship between 𝓓 and the various Bss.

The language 𝓛 and the set 𝓐 are required to contain (descriptions of) enough algorithms to at least have an implementation of any Boolean circuit, in order for the NP-Hardness result to go through. So like, chaining together arbitrary AND, OR, and NOT gates. You could have 𝓛 describe an algorithm as an encoding of a Turing machine, to be decoded by some universal Turing machine, though you’d need them to be Turing machines that can give binary string outputs. But 𝓛 can be any set of strings you like, so long as it contains descriptions of algorithms implementing every possible Boolean circuit.

Now we’ve gone over enough notation that we can look at van Rooji & al.’s problem statement:

AI-ʙʏ-Lᴇᴀʀɴɪɴɢ (ꜰᴏʀᴍᴀʟ)

Given: An integer K and a way of sampling from a distribution 𝓓.

Task: Find a description L ∈ 𝓛, with length |L| ≤ K, of an algorithm A ∈ 𝓐 that with probability ≥ δ(n), taken over the randomness in the sampling, satisfies:

$$ \Pr_{s \sim \mathcal{D}_n} [A(s) \in B_s] \geq \dfrac{|B_s|}{|B|} + \varepsilon(n). $$

Here δ(n) and ε(n) are arbitrary non-negligible functions. A function f is non-negligible if there is some d such that for sufficiently large n, f(n) ≥ 1/n^d.

So AI-ʙʏ-Lᴇᴀʀɴɪɴɢ challenges one to produce a mechanism which, with probability non-negligible in n, finds an algorithm that outperforms returning random bits with probability non-negligible in n. It isn’t stated perfectly clearly, but I am pretty sure that 𝓓n is meant to be the distribution you get if you sample from 𝓓 and then take only the S part.

Now. We are told that this is going to be shown to be NP-Hard. If you are, like me, the kind of person who is pretty sharp at formal math but mostly has only worked with this sort of thing very formally in one college computability and complexity course and so doesn't have enough conventions memorized to be able to tell everything at a glance, one of your first questions when looking at this will be… what on Earth is the time-complexity of this problem supposed to be a function of, exactly? Like, an instance of the AI-ʙʏ-Lᴇᴀʀɴɪɴɢ problem is a string of some length N, and that string contains… what, exactly? How big, exactly, does N have to be as a function of all the pieces of data relevant to the problem? As a loose approximation?

Now, for Harihara’s Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ, the NP-Hard problem they reduce to AI-ʙʏ-Lᴇᴀʀɴɪɴɢ, I was entirely able to figure this out. Harihara takes, if we use the van Rooji & al. notation for consistency, an input consisting of an integer k expressed in binary, and a circuit C expressed in some unspecified language which probably has some small number of characters for each input node, output node, wire, and gate in the circuit. I understand this input format pretty well.

So I’m pretty sure that the input length for Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ is on the order of log(k) + log(|V|)*(|V| + 2|E|), where the logarithms are base two, |V| is the number of vertices (representing nodes and gates) in a graph of C, and |E| is the number of edges (representing wires). You need log(k) bits to describe k, log(|V|) bits to put a label on each of |V| vertices, 2log(|V|) bits to describe each of |E| ordered pairs of vertices, and some extra to describe stuff like which type of logic gate a given vertex represents66 You can probably compress that down a little if you want, but you’re never going to do better using a really simple encoding scheme than, to be incredibly optimistic about how much you might shave off, halving the input length. And that’s just not relevant to asymptotic runtime.

There are round about k+2^(V^2/2) different pairs of numbers less than k and directed acyclic graphs with V vertices. So, hard lower bound, you simply can’t do any better than log(k+2^(V^2/2)). Which is basically V^2 and change, except for incredibly silly problem instances where k is exponential in V.

…Aren’t galaxy-brain large encoding schemes what you’d need to do to cheat, anyways?
.

For AI-ʙʏ-Lᴇᴀʀɴɪɴɢ? Well, if you look at the problem statement, it says it takes as input an integer K and a “way of sampling from” a distribution 𝓓. The integer is straightforward, though you could perhaps express it in unary if you wanted to. What is “a way of sampling from” a distribution?

Western Ways of Sampling

Well, okay, they use the same language when they define Harihara’s Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ, which definitely uses a circuit as a way of sampling from a distribution, because Harihara explains that very clearly in a footnote.

But van Rooji & al.’s NP-Hardness proof has a curious passage:

Every time that the mechanism M asks for a data sample (s, b) consisting of a situation s ∈ S with a corresponding appropriate behavior b ∈ Bs, the simulation of M produces a random string r that is fed as input to the circuit C, resulting in an element (s, b) ∈ {0, 1}^n × {0, 1}, which is given to the mechanism M as data sample.

This sounds to me sort of like an oracle sort of situation, where we treat requesting a new sample from 𝓓 as a sort of black-box operation that could be implemented by pretty much any method—though I suppose the method better be no worse than polynomial time in N, whatever N is. And in this particular case, we implement it in the same way that Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ has for sampling from its distribution.

For more clarity, let us check what van Rooji & al. write about 𝓓 when introducing it earlier on:

In our hypothetical scenario, we will be granting computationalism; i.e., we will not be challenging the assumption that cognition is a form of computation.

A worrying start.

Hence, Dr. Ingenia can safely assume that 𝓓 is generated by a computational process, i.e., either by the computational processes that define a single individual’s cognition, or the cognition of a finite collection of human beings.

Well, fantastic. I’m happy to at least know that 𝓓 is computable by a probabilistic Turing machine or whatnot. I hope that the computational processes which define the cognition of a finite collection of human beings can execute in polynomial time.

Maybe you sort of want to think of this in terms of sample complexity. That is, trying to keep the number of samples from 𝓓 polynomial, and then you just not really caring how involved the sampling process is. Our mechanism pauses while waiting for a new sample and does zero operations during it.

Do we have something like an idea of what a “way of sampling” is, now? Do you understand ways of sampling well enough to do mathematics to them?

In that case, here’s the five dollar question: how many bits are in a way of sampling? The input corresponding to an instance of AI-ʙʏ-Lᴇᴀʀɴɪɴɢ consists of a natural number K, and a way of sampling.

We will consider various possible answers to this question.

Taking the long way

We have an upper bound, of a sort, for how long a way of sampling is. Because we know that the hypothetical polynomial time mechanism M used in the proof by reduction must run in polynomial time as a function of the input to Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ, we can say something about the specific way of sampling wherein the program simulating you passes a random bit string through a circuit and then hands you the output: the encoding of that particular way of sampling must be no more than polynomial in the length of the circuit’s description. Otherwise, our “polynomial time” mechanism M would be too slow to be used by an algorithm solving Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ.

Unfortunately, this doesn’t tell us very much about ways of sampling for distributions 𝓓 that are “generated by either the computational processes that define a single individual’s cognition, or the cognition of a finite collection of human beings”.

Since we’re left unsure about this point, let's just do some analysis of the runtime in terms of various inputs. We will begin by establishing that a way of sampling should probably have any length at all.

Theorem 2.1: Suppose 𝓛 consists of strings in some alphabet. Then AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is at least O(K) = O(exp(log(K)) time.

Let 𝓓 be a distribution which can be approximated by an algorithm of length K, but not by any algorithm of a length that grows sublinearly with K. Such a 𝓓 may not exist for all possible values of K, but I am pretty sure it happening at least occasionally for arbitrarily large values of K follows from the capacity of 𝓛 to describe an algorithm that implements any Boolean circuit. In this case, the required step of writing the output down requires O(K) time on its own.

So if we can just choose any 𝓓 at all we want at no cost, then the intractability of AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is absurdly trivial and simply follows from the fact that we have to write down the answer, which might be long. So presumably large choices of n and m do somehow contribute to the length of the related way of sampling.

In certain limited circumstances, the O(K) bound can actually be achieved:

Proposition 2.2: If we restrict to the case where K grows no faster than logarithmic in n, then AI-ʙʏ-Lᴇᴀʀɴɪɴɢ can be solved by a probabilistic mechanism M in time linear in K.

Consider the mechanism which immediately returns K random characters, each drawn from a uniform distribution over the alphabet. Then, any given algorithm A of length K will be returned with probability 1/(c^K), where c is the number of distinct characters in the alphabet. This probability is lower bounded by 1/(c^(a*log(n))) ~ 1/(n^a) for some constant a, which is non-negligible. Thus, if any algorithm A of length K approximates 𝓓, it has non-negligible probability of being output. Also, I’m pretty sure adding in the algorithms of length less than K doesn’t make a big difference, it’s still non-negligible if you need 2K bits.

Now, technically, the probability is supposed to be over the randomness from the sampling. And certainly you could try to use that to generate a random candidate description of an algorithm. It might add more than linear in K overhead, but still easily polynomial in K. Regardless, I hope my version is enough to convince that, for the interesting cases of AI-ʙʏ-Lᴇᴀʀɴɪɴɢ, K ought to be growing faster than logarithmic in n.

Next, let’s consider a certain questionable choice of 𝓛.

Theorem 2.3: There is at least one language 𝓛 such that AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is Θ(2^n m).

Suppose that 𝓛 encodes an algorithm A with n input bits and m output bits by writing out a lookup table corresponding to that algorithm, such as by concatenating together the m bit output strings for each of the 2^n situations into a big (2^n m) bit string—much as in Hirahara’s Theorem 2.2.

First, observe that in this language, every algorithm which could possibly approximate 𝓓 is encoded in exactly 2^n m bits. So the behavior of our mechanism does not matter if K < 2^n m, but must be to produce an algorithm approximating 𝓓 if K = 2^n m or more.

Since producing an output will require writing m*2^n bits, it is impossible to do better than Ω(2^n m) time. Thus, we will freely use up to Ω(2^n m) steps.

We wish to, with non-negligible probability δ(n)77 It doesn’t really hurt here to assume that δ is even something like 1/2. For non-uniform distributions it could end up helping, but on the uniform distribution there’s no way to get really lucky and grab a bunch of probability mass really quick., observe a non-negligible portion ε(n) of situations s, where each s is weighted by Pr[(s’, b’) = (s, b’)] for a sample (s’, b’) drawn from 𝓓.

Working through the probability theory in full formal detail would be slightly annoying, but I hope you don’t find it too difficult to believe that drawing 2^n * Θ(ε) samples really ought to suffice. In the worst case, a uniform distribution, you expect to see at least (1-ε)/2^n more novel probability mass each sample if you haven’t seen ε probability mass yet. And 2^n * Θ(ε) is surely, surely large enough that the law of large numbers ought to be kicking in even if I don’t feel like proving a formal statement about the convergence rate.

More specifically, we have:

$$ \dfrac{\varepsilon}{1-\varepsilon} \cdot 2^n \cdot \dfrac{1-\varepsilon}{2^n} \approx ε $$

And I’m pretty sure ε/(1-ε) shrinks only very slightly slower than ε does.

Anyways, we’re happy to take our Ω(2^n) samples. Receiving a sample should take n + m time, so we’ve used up no more than Ω(2^n m) time88 Assuming we have ε decrease faster than 1/n..

Next, we’re going to sort all our samples. If you know a little about sorting, you’ll think that this should necessarily require Ω(n2^n) time, a disaster. If you know more about sorting, you know it’s fine. The loglinear bound only applies to sorting algorithms that rely on pairwise comparisons; we’re sorting bit strings, which are basically integers.

Pigeonhole sort is O(n + N), where n is the number of objects being sorted and N is the largest possible key value. I’m pretty sure that’s O(n + N) in some sort of RAM machine, but we can afford to run back and forth across O(ε 2^n m) bits of memory O(n) times, so we ought to be fine. If we notice collisions of data points with the same situation s, we can just forget one of them.

Now it’s a simple matter to write out a final output. We go through all the situations, write down one of the behaviors we saw in that situation if we’ve seen the situation, and otherwise write down a bunch of zeroes or anything else we please. This is just writing out (2^n m) bits, while reading fewer than that. So this is Θ(2^n m).

I hope that Theorem 2.3 can help demonstrate a little about the game that AI-ʙʏ-Lᴇᴀʀɴɪɴɢ is making us play. If you want to learn to approximate a totally arbitrary function non-negligibly well, then you’re going to have to see a non-negligible portion of the information defining the function. That's obviously going to be exponential in n. This is obvious on just like, a combinatorial level. The idea of doing a proof by reduction is incredibly silly, why would you do that.

In general, finding a good approximation is simply going to require samples exponential in n. You cannot learn an arbitrary function of n variables in time polynomial in n, that is simply ridiculous, no proof by reduction is needed, and this has pretty much nothing to do with AGI.

Nonetheless, I am not satisfied yet. I think van Rooji & al. were far too loose with their description of 𝓛. In particular, Harihara says:

Kolmogorov observed that representing a function by a program is the most succinct way of representing a function by any algorithm up to an O(1) additive term. In particular, programs can represent a function more succinctly than circuits.

On the other hand, van Rooji & al. say:

We will add some minimal constraints on 𝓛. We exclude classes of trivial algorithms that have no chance of capturing human-like or -level cognition. Specifically, we impose the constraint that 𝓛 can minimally express feedforward neural networks, logical circuits, or finite state machine-equivalent class of algorithms.

So it is very plausible that the NP-Hardness result for Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ relies on some implicit assertion about 𝓛 being, like, sane.

In fact, it's unclear how to even formalize Hirahara’s Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ in any language that isn’t expressive enough to include Turing machines—it talks about polynomial time programs, and an individual circuit does not have an asymptotic time complexity.

I could attempt to actually describe a language 𝓛 pathological enough for the proof to fail, to check if it’s actually possible, but that sounds like a lot of work. Exercise for the reader.

  1. To be absolutely clear, I think (in agreement with the authors of the paper (presumably)) that AGI replacing specifically women is absurd, that powerful people trying to use AGI to replace specifically women would be bad, and that anyone who thinks the gender of a laborer is a very good indicator of how easily their job can be automated must be not only sexist, but also an idiot.

    My objection is that this is a weak man, and it’s the sort of weak man that suggests to me that the paper is not taking AI remotely seriously.

  2. First objection: Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ as stated by von Rooji & al. was not in fact proven to be NP-Hard in Hirahara 2022. van Rooji & al. define Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ with the Nᴏ case requiring only that there be no program of length k which can predict the distribution 𝓓 non-negligibly better than chance, while Hirahara’s NP-Hardness result (Theorem 1.1) is about a similar problem where the Nᴏ case is actually a stronger condition, requiring that no program of length as high as k*n^ε non-negligibly approximates 𝓓, where

    $$ \varepsilon = \dfrac{1}{(\log \log n)^{O(1)}} $$

    in which O(1) is probably just an unspecified constant natural. In theory it could be a weird oscillating function of n so long as that function has upper and lower bounds, but I’m pretty sure it’s gonna be a constant.

    If I’m figuring this correctly, k*n^ε grows slower than any polynomial, but faster than logarithmic. So as n increases, the program size being ruled out by Hirahara’s problem goes to infinity slowly, but not logarithmically slowly. This is incredibly different from Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ's constant k.

    Is the NP-Hardness of Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ a trivial corollary of Hirahara’s result? Well, Hirahara seems more interested in what happens if you make the discrepancy larger—if NP-Hardness were shown for the version with k*1.01n, this would apparently disprove Impagliazzo’s Heuristica! Does that mean the version with just k, Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ, is easy?

    As far as I can figure… no, it’s at least kinda hard to show, you now have to contend with cases where the smallest good-enough program is of a size like k+1, which sounds like it might be hard to distinguish from the case where it’s size k? I dunno.

    Anyways, it is true that their proof seems to reduce AI-ʙʏ-Lᴇᴀʀɴɪɴɢ to a problem that hasn’t actually been proven to be NP-Hard. But it doesn’t matter at all, because their reduction method would work equally well on Hirahara’s actual problem.

    Second objection: Wow this reduction sure runs M a lot of times, and then for every single time it runs M it sure runs the output algorithm A a whole lot of times. Are we sure that’s all polynomial time in the input length of Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ, which we’ll call N?

    So, van Rooji & al. describe n as one of the inputs to Pᴇʀꜰᴇᴄᴛ-ᴠꜱ-Cʜᴀɴᴄᴇ. This threw me off a little—Hirahama just leaves n to be inferred from the circuit C, and if I’d paid attention to how exactly the circuit is fed as an input I’d’ve noticed my mistake sooner.

    My specific worry was that, if n is only encoded in binary, then algorithms polynomial in n could be exponential in N. So it wouldn't be a polynomial time reduction, which is necessary for the NP-Hardness proof to work.

    But, the way a circuit is typically defined, a circuit that takes n bits as inputs will take at minimum n length to describe. Like, it’s a DAG where the nodes with no incoming vertices are the input bits, the vertices with no outgoing edges are the output bits, and all the intermediate vertices are labelled with logic gates (generally the set used is AND, OR, and NOT.) And a graph used in a complexity problem is going to have to list out every vertex… so there’s essentially an implicit unary encoding of n in the string describing C. The reduction is totally fine and I was slightly being an idiot.

    I guess you could just ignore almost all of the hypothetical input bits, and not put them in the graph, but… no, look, that’s just dumb. If you want to encode a circuit that ignores an input bit, then put the input bit there as a floating vertex.

  3. They subscript some of these symbols but Substack makes subscripting annoying for no good reason and also the subscripts don’t really convey important information because they’re never talking about multiple distinct sets of algorithms at once or anything. I’m not bothering.

  4. You probably don’t want finding the length of a description L to take, like, an exponentially long time, or anything like that.

  5. And so checking if a string is of length K or less is logarithmic in K.

  6. You can probably compress that down a little if you want, but you’re never going to do better using a really simple encoding scheme than, to be incredibly optimistic about how much you might shave off, halving the input length. And that’s just not relevant to asymptotic runtime.

    There are round about k+2^(V^2/2) different pairs of numbers less than k and directed acyclic graphs with V vertices. So, hard lower bound, you simply can’t do any better than log(k+2^(V^2/2)). Which is basically V^2 and change, except for incredibly silly problem instances where k is exponential in V.

    …Aren’t galaxy-brain large encoding schemes what you’d need to do to cheat, anyways?

  7. It doesn’t really hurt here to assume that δ is even something like 1/2. For non-uniform distributions it could end up helping, but on the uniform distribution there’s no way to get really lucky and grab a bunch of probability mass really quick.

  8. Assuming we have ε decrease faster than 1/n.