<p dir="ltr">Modern deep neural networks excel at vision and language tasks, yet we still lack a principled account of <i>why </i>they exhibit generalizing cognitive abilities, and <i>how </i>they implement such abilities. This thesis advances our understanding from a <i>mechanistic </i>perspective. We restrict attention to a modest set of concrete, well-defined problem settings that are analytically tractable yet scientifically meaningful.</p><p dir="ltr"><b>Generalization (Chapters 2 & 3)</b>. We mathematically analyze how properties of the data distribution and training pipeline influence the learnt features of neural networks, and the consequences on model generalization.</p><p dir="ltr">(i) We develop theory that explains a common practice in computer vision: pre-training models with <i>fine-grained</i> labels. By proposing a <i>hierarchical multi-view</i> data model and analyzing the learning dynamics induced by gradient descent, we show that finer labels tend to yield richer representations and, in turn, lower downstream error. (ii) In parallel, we analyze<i> feature-based student-teacher learning</i>, proving that early stopping prevents the student's features from <i>collapsing</i> to the teacher's features, and that teachers with lower-complexity features induce better transfer of knowledge, both strategies yielding more generalizing student networks.</p><p dir="ltr"><b>Interpretability (Chapters 4 & 5). </b>Towards uncovering how large language models (LLMs) reason, we construct two synthetic testbeds: <i>propositional-logic</i> proof generation, and <i>multi-hop in-context </i>question answering. Each task is designed to prevent superficial pattern-matching, while remaining accessible to precise causal interventions. (i) For the logic problems, for multiple LLMs (up to 27B in size), we were able to localize sparse sets of attention heads – specializing in rule retrieval, fact routing, and decision making – that are <i>jointly necessary and sufficient</i> for correct proofs. The cross-model circuit similarity provides early evidence of their universality. (ii) On the multi-hop in-context learning tasks, we find a favorable scaling trend: a well-performing 27B-parameter Gemma model tends to disentangle and compose latent concepts step-by-step, whereas its 2-billion-parameter counterpart (with much lower accuracies) relies on noisy, entangled pathways.</p><p dir="ltr">Collectively, these results indicate that strong performance in modern DNNs can stem from identifiable, interpretable mechanisms rather than opaque heuristics. They pave way for practical guidelines for choosing training regimes that generalize, and for locating the specific model components that implement abstract cognitive abilities.</p>