neural-network-memory-requirement | Misraj AI

Neural Network
Memory Requirement

November 12, 2025

Misraj AI

This Article aims to calculate the memory requirement of DNN. I’ve written this article to provide a simple version of the previous article on LLM memory usage. You can understand this article as you ...

We will compute the memory usage on a simple Neural Network to make the computation of LLM more reasonable.

We will build an NN using the PyTorch framework. We will compute the memory of the simple model, we are using PyTorch because most users are familiar with this framework, but no matter what framework you ganna use the computation is the same.

import os
import torch
from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.gat =  nn.Linear(2048, 4096)
        self.up =  nn.Linear(4096, 4096) 
        self.down =  nn.Linear(4096, 2048)
        self.out = nn.Linear(2048, 10)       

    def forward(self, x):
        gat_proj = nn.functional.relu(self.g(x))
        up_proj  = nn.functional.relu(self.up(gat_proj))
        gat_proj = nn.functional.relu(gat_proj + up_proj)
        gat_proj = nn.functional.relu(self.down(gat_proj))
        gat_proj = gat_proj + x
        out = nn.functional.softmax(gat_proj)

This code defines a neural network using PyTorch’s torch.nn module. The network is structured in a class called NeuralNetwork, which inherits from nn.Module, a base class for all neural network modules in PyTorch. Let's break down the components and functionality of this code

Layers Definition:

self.gat = nn.Linear(2048, 4096): Defines a fully connected layer (linear transformation) that maps input features of size 2048 to output features of size 4096.
self.up = nn.Linear(4096, 4096): Defines another fully connected layer that maintains the feature size at 4096.
self.down = nn.Linear(4096, 2048): Defines a fully connected layer that reduces the feature size back to 2048.
self.out = nn.Linear(2048, 10): Defines a final fully connected layer that maps the 2048 features to 10 output classes (e.g., for a classification problem with 10 classes).

These layers represent the learnable parameters of the model and represent the model_memory, which is what we need at inference time.

let’s break down the NN memory into 4 parts,

parameters memory (model memory) → P
gradient memory ( the same as the parameters memory) → G
optimizer memory → O
activation memory ‘the parameters we save within the forward pass’ → A

Memory usage = Number of elements × Size of each element

The size of each element is defined by the user as 1 byte, 2 bytes 4 bytes. We will consider the Size as 2 for all computations ‘float32’

So we need to get the number of elements within this model to compute the memory.

Memory Computation

for the first three ‘P, G, O’ is quite easy to compute the memory for it, first of all, let’s compute the P.

from the Layers definition, we can notice that we have 4 matrics as follows [gat, up, down, out ], to compute the parameters of this model we only need to compute the number of items in the previous 4 matrics.

P = num_parameters x 2 byte

num_parameters = gat + up + down + out
num_parameters = (2024*4096) + (4096*4096) + (4096*2048) + (2048 * 10)
num_parameters = 33574912
P = num_parameters * 4
P = 134299648 byte
P ~= 128 MB

we said that the gradient G is the same as parameters P,

G = P = 134299648 byte
G = 128 MB

for optimizer O it depends on the optimizer itself, for example, if you take a look at SGD and AdamW, you will notice that we save twice the number of parameters with AdamW for momentum, while the same as the number of parameters with SGD also for momentum, we will use AdamW for our computation.

O = P * 2
O = 256 MB

Now we’ve done the easy part let’s compute the activation memory. The activation memory is the major memory proportion at the training time for the Transformers model structure. but for the this toy example it will be very small.

Let’s explain what is going on with the activation and we will compute the memory directly. The activation memory is the parameters we create during the forward pass, these parameters are saved to compute the backpropagation.

Forward Method (forward method)

This method defines the forward pass of the network, specifying how the input tensor x flows through the network layers and activation functions.

First Linear Layer and Activation:

gat_proj = nn.functional.relu(self.gat(x)): The input tensor x is passed through the gat layer and a ReLU activation function is applied. This transforms the input features from 2048 to 4096 dimensions and applies non-linearity. gat_proj is a vector with 4096 elements ‘float16’

Second Linear Layer and Activation:

up_proj = nn.functional.relu(self.up(gat_proj)): The transformed features are passed through the up layer and another ReLU activation function is applied. This maintains the feature size at 4096 dimensions with additional non-linearity. up_proj is a vector with 4096 ‘float16’

Residual Connection and Activation:

gat_proj = nn.functional.relu(gat_proj + up_proj): A residual connection adds the gat_proj and up_proj outputs element-wise, followed by a ReLU activation. This step is meant to combine the original and transformed features, adding a shortcut connection to improve gradient flow and model performance. This will also add 4096 ‘float16’

Third Linear Layer and Activation:

gat_proj = nn.functional.relu(self.down(gat_proj)): The combined features are passed through the down layer, which reduces the feature size back to 2048 dimensions, followed by a ReLU activation. This will add 2048 ‘float16’

Second Residual Connection:

gat_proj = gat_proj + x: Another residual connection adds the input tensor x to the transformed features gat_proj element-wise. This step further combines the original and transformed features. This also ganna add 2048 ‘float16’

Final Output Layer:

out = self.functional.softmax(gat_proj): The resulting features are passed through a softmax function to produce the final output probabilities for 10 classes. Last thing out with 10 ‘float16’

Now we now each step how much elements will add but usually we use mini-batch during the training, so to make the computation more general we will multiply with the batch size,

activation_memory = barch_size * 4096 +
                    batch_size * 4096 + 
                    batch_size * 4096 + 
                    batch_size * 2048 + 
                    batch_size * 2048 + 
                    batch_size * 10
activation_memory = batch_size * ( 3 * 4096 + 2 * 2048 + 10 )
# Batch_size 128
activation_memory = 128 * (16349) = 2098432
activation_memory = 2098432 * 4byte
activation_memory ~= 8 MB

by this, now we know all the parts we need to estimate the memory to train an NN,

total_memory ~= model_memory + optimizer_memory + gradient_memory + activation_memory
total_memory ~= (128 + 128 + 256 + 8) MB
total_memory ~= 520 MB

Conclusion

In conclusion, we learn how to compute the memory requirement of NN during the training phase, I hope this article will be simpler than the LLM memory usage and easier to read and comprehend, I will be pleased to answer your questions and discuss any feedback, please don’t hesitate to drop any comment either for discussion or if you notice any error within the article.

Khalil Hennara
Ai Engineer at MISRAJ

Built on Trust. Measured by Impact.

Start your journey to smarter solutions