Getting Started with PyTorch for Time-Series Deep Learning

Time-series classification with deep learning doesn't always get the attention it deserves compared to image tasks. But the core principles carry over remarkably well — you just need to adapt your mental model from 2D spatial reasoning to 1D temporal reasoning.

This guide distills what I learned while building EEG classifiers for my research. I'll focus on the practical details that trip people up, not just the architecture diagrams.

The Core Idea: 1D Convolutions

In image CNNs, Conv2d kernels slide over height and width dimensions, learning spatial patterns. For time-series, Conv1d kernels slide along the temporal dimension, learning temporal patterns — oscillations, edges, trends.

import torch
import torch.nn as nn

class TimeSeriesClassifier(nn.Module):
    def __init__(self, n_channels, n_classes, seq_len):
        super().__init__()

        self.features = nn.Sequential(
            # Block 1: capture fast local patterns
            nn.Conv1d(n_channels, 32, kernel_size=7, padding=3),
            nn.BatchNorm1d(32),
            nn.GELU(),
            nn.MaxPool1d(2),
            nn.Dropout(0.25),

            # Block 2: capture slower patterns
            nn.Conv1d(32, 64, kernel_size=15, padding=7),
            nn.BatchNorm1d(64),
            nn.GELU(),
            nn.MaxPool1d(2),
            nn.Dropout(0.25),

            # Block 3: high-level temporal structure
            nn.Conv1d(64, 128, kernel_size=31, padding=15),
            nn.BatchNorm1d(128),
            nn.GELU(),
            nn.AdaptiveAvgPool1d(1),  # Global average pooling
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, 64),
            nn.GELU(),
            nn.Dropout(0.5),
            nn.Linear(64, n_classes)
        )

    def forward(self, x):
        # x shape: (batch, channels, time)
        return self.classifier(self.features(x))

Lesson 1: Normalization Matters Enormously

This is the most important lesson from working with EEG data. Raw EEG signals have wildly different amplitudes across subjects, sessions, and even channels. If you feed raw data to your network, the model will learn amplitude-dependent features that don't generalize.

Z-score normalize per trial, per channel:

def normalize(x):
    # x: (channels, time)
    mean = x.mean(dim=-1, keepdim=True)
    std  = x.std(dim=-1, keepdim=True) + 1e-8
    return (x - mean) / std

Do this as a preprocessing step, not inside the network. You want your network's first layer to receive consistently scaled input regardless of the subject.

Lesson 2: Kernel Size Is a Design Choice, Not a Hyperparameter

Kernel size determines what temporal scale of patterns your convolutions can detect. For EEG at 256 Hz sampling rate:

Small kernels (3–9 samples) — ~10–35ms. Capture sharp transient events, spike-like features.
Medium kernels (15–31 samples) — ~60–120ms. Capture alpha/beta band oscillatory patterns.
Large kernels (63–127 samples) — ~250–500ms. Capture slow drift, low-frequency components.

Multi-scale architectures often win because they process the signal at multiple temporal resolutions simultaneously. Inception-style blocks with parallel convolutions of different kernel sizes are worth trying.

Lesson 3: Don't Underestimate Data Augmentation

When you have 20 subjects × 100 trials × 4 classes, you don't have a lot of data. Augmentation is critical, but you need augmentations that preserve class-relevant signal characteristics while varying nuisance factors.

def augment(x, label):
    # x: (channels, time)

    # 1. Additive Gaussian noise
    if torch.rand(1) > 0.5:
        x = x + torch.randn_like(x) * 0.05

    # 2. Random temporal shift (circular)
    if torch.rand(1) > 0.5:
        shift = torch.randint(-50, 50, (1,)).item()
        x = torch.roll(x, shift, dims=-1)

    # 3. Channel dropout
    if torch.rand(1) > 0.5:
        n_drop = torch.randint(1, 3, (1,)).item()
        ch_idx = torch.randperm(x.shape[0])[:n_drop]
        x[ch_idx] = 0

    return x, label

Lesson 4: Global Average Pooling Over Flattening

When your input length varies or you want to reduce overfitting in the classification head, AdaptiveAvgPool1d(1) (global average pooling) is far better than flattening. It forces the network to produce a single feature vector per channel, regardless of sequence length, and dramatically reduces parameter count.

Setting Up Training

model     = TimeSeriesClassifier(n_channels=8, n_classes=4, seq_len=256)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        logits = model(X_batch)
        loss = criterion(logits, y_batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()

Note the gradient clipping (clip_grad_norm_) — essential for time-series models where gradient norms can explode, especially with longer sequences.

This guide reflects what I learned building classifiers for my UCICS 2026 paper on subject-independent EEG event recognition. The architecture described above is a simplified version of the final model.