Multiple bounding box detection, Part 3 - fine tuning the backbone network: supplement | Machine Learning, Neural Networks, Cloud, et al.

In my previous post I talked about fine-tuning a backbone network. What I didn’t realize was that the dataset didn’t contain as many positive class instances as it could. In the second post of this series I saved the proposals generated by selective search to the disk. What I should have done in that step, was adding the image fragments that represent the cracks as defined in the coco files definitions. As it turns out, their presence changes the outcome for two of the best models trained in the previous post.

The code

The code that I had to add is just a single function. It loads bounding boxes as defined in the coco files, then crops the images accordingly and lastly, resizes them so that they conform to the backbone network requirements. Note the 1_0.1 fragment right before re-appending the file extension. The first part is the IoU threshold which obviously equals 1, the other is the class label - 1 as well - all that to match the format used by data loaders.

transforms = v2.Compose([
    v2.ToPILImage(),
    v2.Resize((224, 224))
])


def extract_and_save_bboxes(images_dir: str, annotations_file: str, output_dir: str) -> None:
    with open(annotations_file, 'r') as f:
        coco_data = json.load(f)

    os.makedirs(output_dir, exist_ok=True)

    image_id_to_file = {image['id']: image['file_name'] for image in coco_data['images']}
    processed_count = 0
    image_crop_count = {}

    for annotation in coco_data['annotations']:
        bbox = annotation['bbox']
        image_id = annotation['image_id']

        if not bbox or len(bbox) != 4:
            continue

        image_file = image_id_to_file.get(image_id)

        if not image_file:
            continue

        image_path = os.path.join(images_dir, image_file)
        image = cv2.imread(image_path)

        if image is None:
            print(f"Could not load image: {image_path}")
            continue

        x, y, w, h = map(int, bbox)
        cropped_image = image[y:y+h, x:x+w]
        cropped_image = torch.tensor(cropped_image).permute(2, 0, 1)
        resized_proposal = transforms(cropped_image)
        image_name, ext = os.path.splitext(image_file)

        if image_name not in image_crop_count:
            image_crop_count[image_name] = 0

        counter = image_crop_count[image_name]
        image_crop_count[image_name] += 1
        output_file = os.path.join(output_dir, f"{image_name}.{counter}.1_0.1{ext}")

        resized_proposal.save(output_file)

        processed_count += 1

        if processed_count % 100 == 0:
            print(f"Processed {processed_count} cropped images.")

Now as for the training code - there’s a little bit of background here. In the previous posts I mentioned that my goal wasn’t finding the best performing network - what I mean by that is that I didn’t want to optimize indefinitely. I only wanted to try out couple approaches to obtain a somewhat satisfying result and move on. I do all the training on my home PC with RTX3600, and even at the time of writing this, it’s not a speed demon. So far training the networks took more than a month, so at some point I just had to say “stop”. So for this supplementing article I didn’t go through all the approaches I tried and I only focused on the two best performing ones. That would be a mistake “in real life” though. Also, for similar reasons I haven’t tried other optimization approaches so far, like cross-validation, hyperparameter tuning etc. - I just didn’t have time for that, but I’m aware of their importance.

Now, to the solution: I was surprised how much the presence of those additional samples changed the landscape. First I tried rerunning the notebook that contained the WeightedRandomSampler with BCELoss function and these were the results:

              precision    recall  f1-score   support

    No crack       0.83      0.79      0.81     39615
       Crack       0.73      0.78      0.76     28723

The recall for the “Crack” class didn’t change, but the precision did, which means the model is now able to detect slightly more true positives. I wasn’t really happy with the results because the accuracy/loss plots showed signs of overfitting at a very early stage, so after being done with that notebook, I run the one where I used the sigmoid_focal_loss function. Interestingly, adding more positive examples made the results worse as compared to the above and to the results achieved when I previously trained a network with that loss function:

              precision    recall  f1-score   support

    No crack       0.79      0.83      0.81     39615
       Crack       0.75      0.70      0.72     28723

That’s a huge performance degradation. Why did it happen? Well, as you may remember from my previous description of the sigmoid_focal_loss function, it contains some moving parts, namely the modulating factor (and two other, but I’ll talk about them in a moment). It seems that with a base version of this loss function, that is, the one with \(\alpha\) not being used, the modulating factor’s behavior made the model pay less attention to the positive class examples - the class in overall became easier to find. I could have created a similar, but customized loss function to influence it, but why do that if there’s already something that can change the modulating factor’s weight - the \(\alpha\) parameter. Like I wrote couple paragraphs above: I did not perform any hyperparameter tuninng, nor cross-validation to find the best \(\alpha\), I basically picked a random number greater than \(0.5\) - why greater than that? Because in the previous post the default value of \(0.25\) degraded the model’s performance greatly. This time I choose to set it to \(0.6\) and here are the results:

              precision    recall  f1-score   support

    No crack       0.85      0.71      0.77     39615
       Crack       0.67      0.82      0.74     28723

As expected, the recall values for the two classes almost reversed, but that’s a good thing: remember the goal I’ve set - which was to optimize for “Crack” class recall.

Conclusion

With all that said, now I’m set up for the next phase which is training a non-neural network classifier. In the original R-CNN paper they chose it to be a SVM, but I think I’ll divert from that slightly and try gradient-boosting approaches with XGB and LightGMB, just for fun and learning.

Multiple bounding box detection, Part 3 - fine tuning the backbone network

Attention mechanisms explained… Again!