MS Lesion Segmentation using Supervised CNN-Based Methods

8 min readApr 26, 2021

Introduction

Multiple Sclerosis or MS is a chronic disease that damages the nerves in the spinal cord and the brain. Sclerosis means scarring or lesion and people with MS develop multiple areas of lesion tissues and in response to where the damage occurs, symptoms may include problems with muscle control, balance, vision or speech.

To diagnose MS, other than lab tests, MRI is done to identify lesions in the tissues of the brain and nerves. Recently, with the advancements in deep learning, scientists have been able to apply computer vision tasks for medical image analysis and have been applied to MS lesion segmentation.

This article is based on the paper which is a survey on the methods which are CNN-based.

What is CNN?

Convolution Neural Networks or ConvNet or CNN is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The architecture of a ConvNet is analogous to that of the connectivity pattern of neurons in the human brain and was inspired by the organization of the Visual Cortex.

Datasets used

MRI images are obtained from challenge datasets like MICCAI 2008, ISBI 2015, MSSEG 2016.

Data preprocessing

Data preprocessing usually involves skull stripping, bias field correction, rigid registration and intensity normalization.

Skull stripping: The MRI of the brain usually contains some non-brain tissues such as skin, fat, muscle, neck and eye balls compared to the functional images. The presence of such tissues is considered as a major obstacle for automatic brain image segmentation and analysis techniques. This process of segmenting the image as brain tissue and non brain tissue is usually done using the “Brain Extraction Tool”.

Bias field correction: Bias field is an undesirable signal that blurs images and corrupts images because of the inhomogeneities in the magnetic fields of the MRI machine thus degrading the performance of the segmentation algorithms. N3(Non-parametric Non-uniform Intensity Normalization) algorithm and its improved version — N4ITK algorithms can be applied for this data preprocessing step.

Rigid Image Registration: Rigid registration is used to register differences between different MRI modalities acquired during a scanning session, between scans of a given patient acquired at different time points, as well as between different subjects of a study or to standard template spaces.

Intensity Normalization: During MR image acquisition, different scanners or parameters would be used for scanning different subjects or the same subject at a different time, which may result in large intensity variations. This intensity variation will greatly undermine the performance of segmentation. Various papers followed a combination of 2 approaches: (i) histogram matching (ii) normalizing data to a specific range.

Data Representation

Since raw MRI images are large, input data is usually represented as patches in 2D, 3D or any format in between. This is an important decision and will affect the performance of neural networks.

The advantage of 2D(slice-based) patches is that there are fewer network parameters to train and therefore are not prone to overfitting. However, contextual information along the 3rd axis will be missing. In contrast, 3D approaches take advantage of the local contextual information, but they do not have any global context. Due to the small size used in 3D patches, the long-distance spatial information cannot be preserved. Also the networks are computationally expensive, need more parameters, and are prone to overfitting.

To obtain a balance between 2D and 3D representations, multi-view data representation (three orthogonal 2D views passing through the same voxel are input to the network jointly) or 2.5D representation (extracting slices from different planes) were also used to train the networks. Using 2.5D and prediction fusion, the networks are able to learn and utilize global information along all three axes.

Data Preparation

Candidate Extraction: To extract input patches from raw data, create a sliding window which moves voxel-by-voxel(voxel is a cube in 3D space) throughout the raw MRI volume. But this creates a lot of similar batches causing class imbalance.

To combat this, for FCN(Fully Convolution Network) method, extract patches with a 50% probability of being centered on a background or foreground voxel or center the patches at a lesion voxel with 99% probability(mostly extract patches with at least one lesion voxel) . For non-FCN methods, any of the four given approaches can be followed. Approach1: Apply a probabilistic VM template and choose the patches whose center is a voxel with high probability in the WM template. Approach2: Randomly under-sample the negative class to make a balanced dataset. Approach3: Random sampling around the lesions. Approach 4: Augment data from lesion class to balance dataset and use circular non-uniform stratified sampling. Among voxels not labeled as lesions, a portion of candidates is extracted from the neighborhood of lesions and the remaining part from the remaining voxels.

Augmentation: Technique to increase the diversity of training set by applying random transformations such as image rotation. It usually includes random flip, random scale, random rotation(2D, angle is n * 90°), 3D rotation or adding random noise and random bias field.

Label Processing: For training and testing models, the labels extracted from expert delineations(expert human radiologists), can be either scalar or a lesion map; but invariably they contain noise. In cases where datasets can are delineated by more than one expert, the inter-rater variability can be addressed using simple voting, STAPLE(Simultaneous Truth and Performance Level Estimation), treat as different samples or by generating training labels by convolving the binary segmentations with a 3*3 Gaussian kernel to get softer versions of boundaries.

Extra Information: Extra information like spatial locations or intermediate results by other models can also be used as input to CNN’s.

Network Architecture

The network architecture plays a very important role to improve segmentation performance.

Network Backbones:

Early methods used voxel-wise classification to form lesion segmentation maps. A typical network consisted of a few convolution layers, followed by 2–3 fully connected layers(MLP). The input is an image patch, and networks are trained to classify whether the central voxel of the patch corresponds to a lesion. The disadvantages with these methods are lack of spatial information as only small patches are used and computational inefficiency due to the repetition of similar patches.

Long et al. introduced Fully Convolutional Networks (FCN). FCN uses a convolution neural network to transform image pixels to pixel-categories. Unlike typical CNN’s previously described, FCN’s do not need to make the lesion prediction by classifying the central voxel of each patch. Instead, they directly generate lesion maps of the same(or similar) size as the input images. But due to successive use of convolution and pooling layers, segmentations produced are at lower resolution. So adding skip connections, help in preserving of the local information from low-level features and contextual information from high level features.

Ronneberger et al. introduced a u-shape network called U-Net. It has an encoder-decoder structure and adds shortcut connections between corresponding layers of the two parts. The pooling operation in the encoder is replaced by upsampling operations in the decoder path.

Variations in U-Net architecture:

For pipelines that use the U-Net, the most common modification is to introduce some crafted modules (residual block, dense block, attention module, etc.). These new modules can be added to replace the convolutions or between the shortcut connections.

Aslani et al. presented a U-Net-like structure with convolutions replaced by residual blocks in the encoder path. Hashemi et al. and Zhang et al. adopted the Tiramisu network, which replaces the convolution layers with densely connected blocks (skip connection between layers). Hu et al. presented a context-guided module to expand the perception field and utilize contextual information. Vang et al. augmented the U-Net with the Mask R-CNN framework. No-new-UNet (nnU-Net) is a multi-architecture framework that adap- tively chooses an ensemble from 2D, 3D, 3D cascaded networks.

Attention Mechanism: Attention is a mechanism by which a network can weigh features by level of importance to a task, and use this weighting to help achieve the task. In general, it can be divided into spatial attention, channel attention, and longitudinal attention. Zhang et al. presented a recurrent slice-wise attention network for 3D CNNs. By performing slice-wise attention from three orientations, the memory demands for 3D spatial attention is reduced. Hu et al. included 3D spatial attention blocks in the de-coding stage. Durso-Finley et al. used a saliency-based attention module before the U-Net structure to make the network realize the difference between pre and post-contrast T1-w images and thus focus on the contrast-enhancing lesions. Hou et al. proposed a cross-attention block that combines spatial attention and channel attention. Zhang et al. [90] extend their folded (slice-wise) attention to the spatial-channel attention module and used four permutations (corresponding to four dimensions) to build four small sub-affinity matrices to approximate the original affinity matrix. In such a case, the original affinity matrix is regularized as a rank-one matrix and the computational burden is alleviated. Gessert et al. introduced attention-guided interactions to enable effective information exchange between the processing paths of the two time points.

Loss Functions

Since the task is segmentation, Binary Cross Entropy (BCE) is the most commonly used loss function. If class imbalance exists weighted L2 Loss — the class with lower probability (i.e. lesion) is compensated with higher weights. Weighted BCE, Focal Loss, Dice (F1) loss and Tversky loss are other losses which can be used for MRI image segmentation task.

Implementation

To train the networks, simple strategy is to use fixed epochs. To avoid over-fitting, early stopping can be implemented. The training dataset can be divided into a fixed training subset and validation subset, or k-fold cross- validation can be utilized. For optimizing the parameters, stochastic gradient descent (SGD) and its variations are used. To adapt the learning rate to the parameters, Adadelta or Adaptive Moment Estimation (Adam) optimizers can be used.

Prediction

For non-FCN methods, segmentation is made by classifying all the candidates extracted from the test image voxel-by-voxel. For FCN methods, 2D networks predict the segmentation slices-by-slice, and 3D methods are able to predict the whole image at once.

Evaluation Metrics

In the task of MS lesion segmentation, the commonly reported metrics include: Dice similarity coefficient (DSC), Jaccard coefficient, absolute volume difference (AVD), average symmetric surface distance (ASSD/SD), true positives rate (TPR, sensitivity, recall), false positives rate (FPR), positive predictive value (PPV, Precision), lesion-wise true positives rate (LTPR) and false positives rate (LFPR). Precision-Recall (PR) curve is suitable for evaluating the performance of the highly unbalanced dataset. Receiver Operating Characteristic (ROC) curve is also used.

Results

TheMICCAI 2008 and ISBI 2015 competitions are still accepting submissions and providing the evaluation results on the test dataset, thus serving as objective benchmarks. By comparing the scores in both the competitions it is observed that 3D and 2.5D methods outperform 2D approaches. Also, U-Net based architectures tend to perform better than non-FCN CNN-based and non -U-Net FCN-based architectures. Based on the scores Tiramisu Architecture with 2.5D method by Zhang et al. has the best scores.

For more information about various techniques used, kindly refer to the survey paper.