Gradient Sparse Autoencoders

11/15/2024


Introduces a novel approach to extracting neural network features by considering both activation values and their downstream effects.

Traditional sparse autoencoders focus only on activation magnitudes. Our approach incorporates gradient information to identify features that are not just active, but actually influential in the network's computations.

Gradient analysis visualization

View paper →