The netowrk used in this work is a VGG based architecture, equipped with a soft proposal layer. It is trained on a database of image-level labelled items. It produces a class-aware soft-proposal map to predict region localisation
Check it outThe resulting network is able to achieve ~90% detection accuracy on a test set. Detection and localisation have real-time performance at inference time (40ms per image)
Check it out