Now the idea is to implement a FOTS unified network for text recognition. To do this, we need 2 patterns, to detect one text area (delimiting box) and identify another text, and to straighten the corners between the 2 layer templates.

Here we are going to use two datasets (or more). Ideally, we use Synthtext dataset education and training ICDAR dataset fine-tune our model.

What we have: 2 models, 1 mezzanine floor and 2 datasets. Let’s get started

We have 2 datasets in different formats. We need to bring them together in one form for our convenience.

I have converted the Synthtext dataset to ICDAR format because the format is simple.

Data format: x1, y1, x2, y2, x3, y3, x4, y4, <> word in block <>

Note: The coordinates of the binding boxes are from top left clockwise

Check mine EDA notebooks a detailed explanation of the data.

Now that we have all the information in one format. Let’s start with the actual pretreatment.

We need to add pictures to make the model work better. The steps are:

  • -10-10 degrees to rotate

Once the augmented data is produced, format the basic truth of the models. We have 2 models, one for identification and one for identification.

For detection, we need to find the restrictive text boxes, so we need:

  • Score maps: An image where the value of each pixel is 0 or 1, 1 indicates that the pixel is part of a bounding box and 0 indicates the opposite.
Big Data jobs

For ROIRotate, we have the coordinates of the boxes and the angle to which they are directed.

For Recognition, the final text is in the binding box.

Generators are used to retrieve data because image data consumes a lot of memory.

input_data=[batch_images,batch_text_polyses,batch_text_labels,                                    batch_boxes_masks,transform_matrixes, box_widths,batch_rboxes]output=[batch_image_fns,batch_score_maps,batch_geo_maps,                                batch_training_masks,batch_text_labels_sparse,box_widths]

yield (input_data,output)

The whole model building process is divided into 4 parts.

1.Shared turn: Here we take the image And pass it through the pre-trained ResNet-50 followed by the deconvolution and sampling layers so that the final dimension is 1/4 of the size of the original image. Final convolutions are called split convolutions, and these are the characteristics of the image.

Split thread architecture

As shown in the snippet above, we can consider any pre-trained network, but resnet-50 is recommended. For real-time delay, use resnet-30.

1. Why do business AI projects fail?

2. How will artificial intelligence trigger the next wave of healthcare innovation?

3. Machine learning using a regression model

4. Most popular computing platforms in 2021, other than Kaggle

2.Certificate: Create a custom template with 3 levels that are individually trained and chained. The OCR layer produces dense pixel-by-pixel text prediction using the properties produced by shared convolutions.

Preprocessed image (input)
  • F_score: Predicts the probability that a pixel is positive (part of a bounding box).
F-score

Multiply the layers by 512 (image size) so that the pixel distances are mapped according to the size of the image.

  • angle_map: predicts the direction of the associated bounding box.
pixel distance from below

Correspondingly, the angle map is multiplied by pi / 2 so that they are mapped with respect to the vertical line.

Finally, the geo map and the angle map are concatenated to obtain an image with 5 layers.

Once we have the boxes, we use locally aware NMS (non-max attenuation) to get the latest boxes with the highest probability.

3.ROIRotate: The boxes obtained from our branch now found may be tilted and some may not. So to get all the boxes to the same level, i.e. a 0 degree tilt, we use the ROI Rotate feature.

Picture of RoIRotate

The paper suggests implementing this using affine transformation, but I have used tf.image.crop_to_bounding_box to delimit boxes and tfa.image.rotate to align all boxes.

During the internship, we use images of the truth to get boxes that will later be used to train the tag model. But in the reasoning phase, we can get the boxes from the predictive branch prediction (with NMS) and target these boxes.

4.Recognition: Once the boxes are received, we create another custom template that should be able to recognize the text in the box. For this, we use the neural network mentioned below.

Detailed structure of the OCR branch

The last fully connected layer acts as the feed layer CTC (Connectionist Temporal Classification) decoder. The whole network is known as the CRNN (Convolutional Recurrent Neural Network).

the high-level architecture of a typical CRNN

In practice, shared convolutions and detection are trained in one part and identification in another.

Identification loss:

Observational loss takes into account balanced cross-tropy (as in paper, also known as classification loss).

Loss of identification (classification component):

The second component of the observation loss takes into account the predicted IoU between the boxes limiting the truth of the soil and the rotation of the predicted constraint boxes.

Detection loss (regression loss):

The total observation loss consists of the above losses:

  • Lambda_reg is a hyperparameter and set to 20 in this implementation.

Recognition loss:

The CTC loss is used as an identification loss. The total FOTS loss consists of observation loss + identification loss.

  • Lambda_reg is also a hyperparameter and set to 1 in this implementation.

During the internship, we train both models separately and review their results. Based on performance, we can change the loss weighting to improve the corresponding model.

Due to hardware limitations, I could not train the model to that extent. I have trained it in 4,000 images in synthetic datasets and used tesseract instead of a recognition model because practicing OCR models require state-of-the-art technology.

Interfrence: Result (left) and scorecard (right)

For good results:

Practice your models with synthetics and then fine-tune it to the ICDAR dataset.

Practice for a detection model loss of 0.02 or less and an detection model loss of 0.5 or less.

Use the slope of the tape to control the training patterns.

Visit my GitHub repo site to get the code and feel free to contact me on my LinkedIn site for further discussion.

Github:https://github.com/akhildasari/FOTS_keras_TF2

LinkedIn:https://www.linkedin.com/in/akhil-dasari/

LEAVE A REPLY

Please enter your comment!
Please enter your name here