Gradient Sparse Autoencoders

Introduces a novel approach to extracting neural network features by considering both activation values and their downstream effects.

Traditional sparse autoencoders focus only on activation magnitudes. Our approach incorporates gradient information to identify features that are not just active, but actually influential in the network's computations.

Gradient analysis visualization

View paper →