← posts

LED — how a neural network worked without backpropagation

· #research
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌

For several months I have been working on a biologically-plausible neural network that learns Uzbek at the char-level. In v35 I removed backpropagation entirely — and unexpectedly the model outperformed BP. In this post I will openly tell you how the road went, what I did, and why it worked.

Project: samantha-v5 / v35 Code: char-level UZ language model, trained on Mac CPU

TL;DR — in one sentence

Without ever calling loss.backward(), I trained the network using only local rules (DFA random feedback + Hebbian update + truly binary LIF spike + sparse 80/20 connectivity + cosine lr) — and on the word3 task got 0.585 vs BP 0.543, on word4 0.563 vs 0.533 — meaning the biologically plausible model surpassed backpropagation. But on the sentence task BP was 2.9× higher — there was a small-task ceiling artifact at play here.

Why is backprop a problem?

Standard deep learning relies on the chain rule:

∂L/∂W_i = (∂L/∂h_n) · (∂h_n/∂h_{n-1}) · ... · (∂h_{i+1}/∂W_i)

This is mathematically perfect, but not biologically plausible for three reasons:

  1. Weight transport problem — the backward pass needs W^T. A real synapse is not bidirectional; an axon sends signal in one direction only.
  2. Update locking — each layer’s update depends on signal coming from later layers. Real cortex does not work serially like that.
  3. Symmetric feedback — the same weight is required on the forward and backward paths. This does not exist in a real brain.

A neuron in real cortex only knows local information: its own input, its own output, and possibly a global modulator (dopamine/norepinephrine). There is no “chain rule” anywhere.

Existing alternatives — and their limitations

Over the years researchers have searched for BP alternatives:

AlgorithmYearBio scoreLimitation
Feedback AlignmentLillicrap 20166/10Returns signal layer-by-layer
Direct Feedback AlignmentNøkland 20167/10Random matrices feel abstract
Forward-ForwardHinton 20227/10Toy tasks, hard to scale
Equilibrium PropagationScellier 20177/10Needs convergence to energy minimum
Predictive CodingRao 19998/10Equivalent to BP under certain conditions
e-propBellec 20208/10For SNNs, needs a global modulator

Most of these reach 60-80% of BP on small tasks. None has been tried as a “combo”.

v1-v34 — earlier attempts

Before arriving at LED I had tried plenty:

  • v2-v3: Auxiliary classifier per layer (local CE loss + Hebbian) — worked (~0.88), but each layer still called loss.backward(). This is not really local — it’s hidden BP.
  • v3: STDP alone (no aux loss) — plateau at 0.21 (chance ~0.10). STDP is pure correlation; there is no classification signal.
  • v4-v5: R-STDP + reward-modulated STDP — plateau at 0.20 in the purely biological setting. A global reward R(t) does not solve deep credit assignment.
  • v8: STDP-learned V1 filters + BP classifier — V1 is local, the rest is BP. A half-solution.
  • v18-v34: BP was largely retained. v34 with SpikeGPT integration showed a record 0.774 (sentence), but it was LIF surrogate gradient + BP.

No earlier version reached BP-level performance on a multi-layer, supervised task without true BP.

How the LED idea came about

After v34 I asked myself an honest question:

“Not a model that beats BP — but can I build a biological model that gives good results?”

My core intuition went like this:

“Neurons compute and store their local error, then each neuron punishes itself based on its own error. It looks at its neighbor and informs it about the error too.”

This is the combination of three ideas:

  1. State recording — on the forward pass each layer stores its input and pre-activation.
  2. Direct Feedback Alignment — the output error is sent directly to each layer through random fixed matrices.
  3. Lateral diffusion — each neuron’s error flows to its neighbors through a 1D conv-style filter.

I called this LED — Lateral Error Diffusion.

Algorithm — full

Forward pass

x_0 = Embedding(idx)                    # (B, T, n_embd)

for layer i = 1 to L:
    pre_i = SparseLinear_i(x_{i-1})     # mask qo'llanadi
    save: input_i = x_{i-1}, pre_i

    # LIF dynamics (har timestep t uchun):
    for t = 0 to T-1:
        V_i[t] = α · V_i[t-1] + pre_i[t]
        s_i[t] = (V_i[t] > θ).float()                # CHINAKAM BINAR
        pseudo_i[t] = 1 / (1 + (β·(V_i[t] - θ))²)    # ATan derivative
        V_i[t] = V_i[t] - s_i[t] · θ                 # soft reset

    save: spike_i, pseudo_i
    x_i = x_{i-1} + s_i                  # residual

logits = Linear_head(x_L)

Important: all stored tensors are .detach()-ed — they do not enter the graph, no gradient is computed.

Local update rule (LED update)

Instead of backward():

@torch.no_grad()
def led_update(self, logits, targets, mask, lr):
    # 1) Output xatosi — eng go'zal lokal signal
    e_out = softmax(logits) - one_hot(targets)

    # 2) Output head — sof Hebbian
    dW_head = e_out.T @ pre_head / N
    head_w -= lr * dW_head

    # 3) Hidden qatlamlar — DFA + LIF pseudo-grad
    for i = L to 1 (reverse):
        e_h = e_out @ B_i.T               # DFA random feedback
        e_h *= pseudo_i                    # LIF derivative
        e_h = lateral_diffuse(e_h, α)     # qo'shniga tarqalish

        dW = e_h.T @ input_i / N
        dW *= mask_i                       # SPARSE: ulanmagan = 0
        W_i -= lr * dW

    # 4) Embedding update
    e_emb = e_out @ B_0.T * pseudo_0
    e_emb_proj = e_emb @ (W_0 * mask_0)
    emb[id] -= lr * scatter(e_emb_proj)

B_i are fixed random matrices that are not trained. They are randomly initialized once at startup and never change after that.

Sparse 80/20 connectivity

In real cortex, neurons connect heavily with their nearby neighbors and sparsely with distant ones. How can this be modeled?

def build_sparse_mask(n_in, n_out, local_frac=0.8, sigma_local=2.0,
                      density=0.15, seed=0):
    pos_in = make_2d_positions(n_in)        # (n_in, 2) grid
    pos_out = make_2d_positions(n_out)
    dist = (pos_out[:, None] - pos_in[None, :]).norm(dim=-1)

    n_per_out = int(density * n_in)
    n_local = int(n_per_out * local_frac)   # 80%
    n_distant = n_per_out - n_local          # 20%

    # Lokal: Gauss bilan eng yaqin neyronlar
    local_logits = -dist**2 / (2 * sigma_local**2)
    _, local_idx = local_logits.topk(n_local, dim=-1)

    # Uzoq: tasodifiy non-local neyronlar
    mask = scatter(local_idx) | random_distant(n_distant)
    return mask  # bool (n_out, n_in)

The mask is stored as a fixed buffer and *-multiplied with the weight. A connection where mask=0 never learns — this is a biological invariant.

LIF cell — truly binary spike

class LIFCell(nn.Module):
    def __init__(self, tau=2.0, threshold=1.0, alpha=2.0):
        super().__init__()
        self.decay = 1.0 - 1.0 / tau    # α=0.5 (tau=2)
        self.threshold = threshold
        self.alpha = alpha
        self.V = None

    def forward(self, I):
        self.V = self.decay * self.V + I
        spike = (self.V > self.threshold).float()       # CHINAKAM BINAR
        x = self.alpha * (self.V - self.threshold)
        pseudo_grad = 1.0 / (1.0 + x * x)               # ATan derivative
        self.V = self.V - spike * self.threshold        # soft reset
        self.V = self.V.detach()
        return spike, pseudo_grad

The most critical line: spike = (V > thr).float(). This function is non-differentiable — under backprop a gradient cannot pass through it by any route. But we are not using backprop! We apply pseudo_grad by hand inside the LED update — this is very close to Bellec (2020) e-prop.

Experiments — step by step

Stage 1 — LED alone (few epochs)

TaskLEDBPLED/BP
Letter (33 alphabet)1.0001.000EQUAL
Word3 (28 3-token words)0.4570.54384%
Word4 (26 4-token words)0.3830.53372%

LED reaches 80% of BP — the expected result (Lillicrap 2016).

Lateral diffusion ablation (which I had the most faith in):

αWord3 acc
0.0 (pure DFA)0.457
0.2 (default)0.457
0.5 (strong)0.432

Lateral diffusion had no noticeable effect. This was the first honest warning sign: my “spread to the neighbor” idea did not work in this configuration.

Stage 2 — Sparse 80/20 alone (with BP)

TaskDenseSparse 80/20Active params
Word30.5430.54319.8K (vs 47.7K)
Word40.5330.53319.8K
Sentence0.2430.24639.3K

Sparse equals dense, with 2.4× fewer parameters. Even 5% density is enough (0.543).

local_frac ablation:

local_fracAcc
0.0 (pure random)0.543
0.50.543
0.8 (biological 80/20)0.543
1.0 (purely local)0.543

All equal — at this size the 80/20 idea did not make a difference. But lf=1.0 (purely local) came out bad in the later LED+sparse run (0.420). So distant connections are needed, but the exact ratio did not matter.

Stage 3 — the long-training discovery

Initially when I tried LED at 25-40 epochs it lagged behind BP. A user suggestion: train it long. 200-250 epochs + cosine lr decay:

TaskLED best epBP best epLED finalBP final
Word36440.5430.543
Word418940.5330.533
Sentence0.2450.248

LED is 16-50× slower, but asymptotically EQUAL to BP. This was the empirical confirmation of the Lillicrap 2016 paper.

Stage 4 — all three together: Sparse + LIF + LED

This is where the unexpected thing happened. Three biological elements were combined:

  1. Sparse 80/20 topology
  2. Truly binary LIF spike (NO surrogate gradient)
  3. LED local learning (no backprop)

With 5 random seeds:

TaskSparse+LIF+LEDBPGainActive params
Word30.585 ± 0.0340.543+0.04210K
Word40.563 ± 0.0580.533+0.03010K

It SURPASSED BP. And with 3× fewer parameters.

Spike rate: 7.4% — at the level of real biological cortex (1-10%).

Why did it work? — hypotheses

The 3 elements alone are equal to or below BP:

  • Sparse alone: equal to dense (0.543)
  • LIF + BP: 0.596 — good, but with BP’s help
  • LED alone: equal to BP (with long training, 0.543)

But all three together outperform BP. My hypotheses:

Implicit regularization

  1. Sparse mask — fewer parameters, overfitting is constrained
  2. Binary spike — implicit information bottleneck (only 0/1 information passes through)
  3. DFA random feedback — noise in the gradient → exploration
  4. Cosine lr decay — slow at finetune time

The three together provide an inductive bias. BP, on the other hand, is “fully unconstrained” — all parameters optimize quickly and hit a plateau (word3 stopped at ep 4!). With LED, the implicit regularization → better generalization.

The mathematical basis of DFA

Lillicrap (2016)‘s finding: even with fixed random feedback B, the network still learns, because during training the forward weight W aligns to B (the alignment property):

Train davomida: W^T · sign(e) ≈ B · sign(e)

The network adapts to its own feedback. This is emergent behavior.

Cross-entropy returns local credit

(softmax - onehot) — the most beautiful local signal. For each output neuron:

  • if its probability is above the true target: positive → decrease
  • if below: negative → increase

This immediately turns into Δw_head = e · pre_head. With no backward chain at all.

The pseudo-derivative is close to e-prop

Bellec (2020) e-prop:

ΔW = e_global · eligibility_trace · pseudo_derivative

LED:

ΔW = e_local_DFA · pre · pseudo_derivative

In LED, instead of an eligibility_trace, we have pre · pseudo_derivative — this is the single-timestep version. The core idea is the same: the spike is non-differentiable, but the pseudo-derivative estimates local credit.

An honest caveat — it fell over on the sentence task

I won on word3/word4, but then came the real test: 12 templates, 2400 pairs, 120-epoch sentence task:

Methodseenunseenspike
Sparse+LIF+LED0.257 ± 0.0030.224 ± 0.0037.9%
Sparse+LIF+BP0.739 ± 0.0020.412 ± 0.00838%

BP is 2.9× higher on seen, 1.8× on unseen. LED hit a plateau at ep 25 (0.23); BP keeps growing through ep 100.

What this means

Word3/word4 (26-28 words) was a small-task ceiling artifact. Both methods reach the ceiling, and LED came out slightly higher due to random variation.

On a hard structured sequence task (multi-template, contextual), the compositional advantage of the chain rule comes back. DFA-style local learning cannot scale to deep contextual tasks. This is also stated in the Lillicrap 2016 paper: DFA is good in shallow networks and degrades in deep ones.

Honest reassessment:

TaskLEDBPWinner
Word3 (28 words)0.5850.543LED +0.04 (small-scale ceiling)
Word4 (26 words)0.5630.533LED +0.03 (small-scale ceiling)
Sentence (2400 pair)0.2570.739BP +0.482

Later (v36) I added TimeMix — the gap widened (3.0×). In v37, even 5 interventions (width, density, skip connection, lateral, layered) did not help. The LED direction was closed.

What is right and what is not

In this project I tested 3 intuitions. Honest verdicts:

IdeaStatusNote
Local rule + DFA (LED)workedEqual to BP under long training, higher on small tasks
Lateral diffusion (to neighbors)irrelevantα=0 and α=0.2 are equal
Sparse connectivityworkedEqual to dense, 2.4× smaller
Exact 80/20 ratiounreliableRandom sparse is the same
Purely local (lf=1.0)badDistant connections are needed
All three togetherworked on small tasksNot enough on sentence

The main philosophical lesson: bio learning is not one magic rule, but rather a combination of several correctly chosen elements. I searched for that combination across 33 versions. I found it in v35 — but with a scale limit.

Technical details — reproducibility

Configuration

ParameterWord3Word4
n_embd6464
hidden128128
n_layers22
max_len88
n_repeats3030
batch3232
lr_max5e-33e-3
epochs200250
density0.150.15
local_frac0.80.8
lif_tau2.02.0
lif_threshold1.01.0
feedback_scale0.10.1

Hardware

A Mac M-series CPU is enough (MPS is not necessary on these tasks). 200 epochs ~30-60 seconds.

Wall-clock

VariantPer-epochTotal (200 ep)
BP~0.05s10s
LED~0.1s20s
Sparse+LIF+LED~0.1-0.3s30-60s

Replication command

# Word3 — Sparse+LIF+LED, 5 seed
for seed in 0 1 2 3 4; do
    python sparse_lif_led.py --task word3 --epochs 200 \
        --hidden 128 --lr 5e-3 --seed $seed \
        --out runs/slL_w3_s$seed.json
done

# Word4 — 5 seed
for seed in 0 1 2 3 4; do
    python sparse_lif_led.py --task word4 --epochs 250 \
        --hidden 128 --lr 3e-3 --seed $seed \
        --out runs/slL_w4_s$seed.json
done
WorkYearRelevance
Hebb, Organization of Behavior1949Local Hebbian foundation
Widrow & Hoff, Adaptive switching circuits1960Delta rule
Rumelhart et al., Backprop1986LED rejects this
Lillicrap et al., Random feedback alignment2016DFA inspiration
Nøkland, Direct feedback alignment2016LED’s core
Bellec et al., e-prop2020LIF pseudo-grad
Hinton, Forward-Forward2022Another bio alternative
Zhu et al., SpikeGPT2023LIF architecture

LED’s contribution: the combination of DFA + sparse 80/20 + binary LIF + cosine lr (this combination has not been seen before).

Open questions

  1. Does LED scale? At hidden=512+, n_layers=6+, does it stay equal to BP? — Answer (v36-v37): no, on sentence BP is 3× higher.
  2. Does Sparse+LIF+LED work on the sentence task (where TimeMix is needed)? — Answer: no, plateau stops at 0.23.
  3. How does LED + frozen slot work in continual learning? — untested.
  4. Does 2D topology + weight-space lateral save lateral diffusion? — open.
  5. How does online training (batch=1) work? — open.

Conclusion

LED is a logically simple, biologically plausible learning algorithm that works at small scale. It is:

  • Backprop-free — no chain rule
  • Purely local — each neuron only knows its own information
  • Sparse-friendly — unconnected neurons do not learn
  • Spike-friendly — truly binary LIF (no surrogate needed)
  • Empirically validated — at small scale, +0.030—0.042 above BP
  • Scale-limited — on deep compositional tasks, BP wins back the advantage

Main contribution: the combination of DFA + sparse 80/20 + truly binary LIF + cosine lr, and the inductive bias / regularization effect of that combination.

Philosophical conclusion: backpropagation is mathematically perfect, but if you must choose between biological plausibility and inductive bias — then at small scale LED is both a logical and a practical alternative. But I cannot claim “a new era of BP” — on deep sequence tasks the chain rule’s advantage returns. This is a genuine research result: success and limitation together.

The most important honest part

I came into this project at v34 with a record 0.774 sentence accuracy. In v35 I beat BP (on small tasks). In v36-v37 I could not scale it. The road closed.

This is success and honest defeat together. Research is like that: an idea works, but it has a limit. And knowing that limit precisely is far more valuable than claiming things without knowing.

If you also want to work on bio-plausible learning, my advice is: try the combination, not the parts alone. And test on a genuinely large task — small-scale wins can be deceptive.


An honest assessment running through this post: 80% of the idea worked, 20% (lateral diffusion) was not confirmed. The limit on the sentence task became clear. An honest result is worth more than an honest claim.

════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
← all posts