Now the idea is to implement a FOTS unified network for text recognition. To do this, we need 2 patterns, to detect one text area (delimiting box) and identify another text, and to straighten the corners between the 2 layer templates.
What we have: 2 models, 1 mezzanine floor and 2 datasets. Let’s get started
We have 2 datasets in different formats. We need to bring them together in one form for our convenience.
I have converted the Synthtext dataset to ICDAR format because the format is simple.
Data format: x1, y1, x2, y2, x3, y3, x4, y4, <> word in block <>
Note: The coordinates of the binding boxes are from top left clockwise
Check mine EDA notebooks a detailed explanation of the data.
Now that we have all the information in one format. Let’s start with the actual pretreatment.
We need to add pictures to make the model work better. The steps are:
- -10-10 degrees to rotate
- Height measurement 0.8–1.2
- A 600×600 random sample is cropped from the image.
Once the augmented data is produced, format the basic truth of the models. We have 2 models, one for identification and one for identification.
For detection, we need to find the restrictive text boxes, so we need:
- Score maps: An image where the value of each pixel is 0 or 1, 1 indicates that the pixel is part of a bounding box and 0 indicates the opposite.
- Geographical maps: This map is an image with 5 layers (top, bottom, right, left, corner). Here, each pixel indicates the possibility of being the edge of that page.
- Training mask: This includes restrictive boxes that should be considered during training. Boxes with an area smaller than the mentioned threshold are rejected (a threshold can be set for us).
For ROIRotate, we have the coordinates of the boxes and the angle to which they are directed.
For Recognition, the final text is in the binding box.
Generators are used to retrieve data because image data consumes a lot of memory.
input_data=[batch_images,batch_text_polyses,batch_text_labels, batch_boxes_masks,transform_matrixes, box_widths,batch_rboxes]output=[batch_image_fns,batch_score_maps,batch_geo_maps, batch_training_masks,batch_text_labels_sparse,box_widths]
The whole model building process is divided into 4 parts.
1.Shared turn: Here we take the image And pass it through the pre-trained ResNet-50 followed by the deconvolution and sampling layers so that the final dimension is 1/4 of the size of the original image. Final convolutions are called split convolutions, and these are the characteristics of the image.
As shown in the snippet above, we can consider any pre-trained network, but resnet-50 is recommended. For real-time delay, use resnet-30.
2.Certificate: Create a custom template with 3 levels that are individually trained and chained. The OCR layer produces dense pixel-by-pixel text prediction using the properties produced by shared convolutions.
- F_score: Predicts the probability that a pixel is positive (part of a bounding box).
- geo_map: Each pixel predicts the distance from top, bottom, right, left and thus has 4 filters.
Multiply the layers by 512 (image size) so that the pixel distances are mapped according to the size of the image.
- angle_map: predicts the direction of the associated bounding box.
Correspondingly, the angle map is multiplied by pi / 2 so that they are mapped with respect to the vertical line.
Finally, the geo map and the angle map are concatenated to obtain an image with 5 layers.
Once we have the boxes, we use locally aware NMS (non-max attenuation) to get the latest boxes with the highest probability.
3.ROIRotate: The boxes obtained from our branch now found may be tilted and some may not. So to get all the boxes to the same level, i.e. a 0 degree tilt, we use the ROI Rotate feature.
The paper suggests implementing this using affine transformation, but I have used tf.image.crop_to_bounding_box to delimit boxes and tfa.image.rotate to align all boxes.
During the internship, we use images of the truth to get boxes that will later be used to train the tag model. But in the reasoning phase, we can get the boxes from the predictive branch prediction (with NMS) and target these boxes.
4.Recognition: Once the boxes are received, we create another custom template that should be able to recognize the text in the box. For this, we use the neural network mentioned below.
The last fully connected layer acts as the feed layer CTC (Connectionist Temporal Classification) decoder. The whole network is known as the CRNN (Convolutional Recurrent Neural Network).
In practice, shared convolutions and detection are trained in one part and identification in another.
Observational loss takes into account balanced cross-tropy (as in paper, also known as classification loss).
Loss of identification (classification component):
The second component of the observation loss takes into account the predicted IoU between the boxes limiting the truth of the soil and the rotation of the predicted constraint boxes.
Detection loss (regression loss):
The total observation loss consists of the above losses:
- Lambda_reg is a hyperparameter and set to 20 in this implementation.
The CTC loss is used as an identification loss. The total FOTS loss consists of observation loss + identification loss.
- Lambda_reg is also a hyperparameter and set to 1 in this implementation.
During the internship, we train both models separately and review their results. Based on performance, we can change the loss weighting to improve the corresponding model.
Due to hardware limitations, I could not train the model to that extent. I have trained it in 4,000 images in synthetic datasets and used tesseract instead of a recognition model because practicing OCR models require state-of-the-art technology.
For good results:
Practice your models with synthetics and then fine-tune it to the ICDAR dataset.
Practice for a detection model loss of 0.02 or less and an detection model loss of 0.5 or less.
Use the slope of the tape to control the training patterns.
Visit my GitHub repo site to get the code and feel free to contact me on my LinkedIn site for further discussion.