<p dir="ltr">In-memory computing (IMC) has emerged as a promising paradigm to overcome the performance and energy bottlenecks of traditional von Neumann architectures. By performing matrix-vector multiplications (MVMs) within memory arrays, IMC enables massively parallel computation while significantly reducing costly data movement. Despite these advantages, deploying IMC for modern machine learning models remains challenging due to the high overhead of analog-to-digital converters (ADCs), limited on-chip area, and the demand for both linear operations and transcendental functions in transformer-based models.</p><p dir="ltr"><br></p><p dir="ltr">This thesis presents four contributions that address these challenges from the architecture level to the device level. First, SAMBA introduces a sparsity-aware IMC architecture that balances non-uniform sparsity across crossbars and minimizes inter-core communication, thereby amplifying the benefits of reconfigurable ADCs. To scale beyond limited chip budgets, the second contribution develops an IMC system with off-chip memory that integrates area-constrained workload partitioning, optimized weight reloading, and selective replication, ensuring efficient execution of large neural networks that cannot fit fully on-chip. Third, HASTILY advances IMC into the domain of transformer inference, where attention modules and softmax operations impose quadratic memory and computational complexity. Leveraging SRAM-based Unified Compute-and-Lookup Modules (UCLMs), HASTILY embeds both MVM operations and lookup-table references for softmax into the same memory arrays, reducing area overhead while improving throughput and energy efficiency. Finally, building upon this idea, MemRaptor integrates ROM functionality into STT-MRAM crossbars, enabling both RAM and ROM to coexist within a single array. This dual-functionality allows MVMs and transcendental functions such as sigmoid, tanh, and softmax to be executed in the MRAM-based memory, supporting recurrent and transformer-based NLP workloads including LSTM, BERT, and GPT.</p><p><br></p><p dir="ltr">Together, these works form a coherent design framework for IMC: from architectural techniques for sparsity, to system-level scalability with off-chip memory, to hardware-software co-design for transformers, and ultimately to device-level functional enrichment. Evaluations across convolutional, recurrent, and transformer models consistently demonstrate significant improvements in throughput and energy efficiency. Overall, this thesis establishes that the success of IMC as a next-generation AI accelerator depends on cross-layer co-optimization across devices, circuits, and architectures.</p><p dir="ltr"><br></p>