Table Document Image To Structured Text Using Deep Learning Architecture (TabStructNet) And Deployment As An Android Application
Introduction
This blog post is a part of an assignment assigned by the AppliedAICourse team.
The Android Application has not been made public yet, minor changes need to be applied before it goes online.
With the increase in data availability in different forms, e.g. images, it has become crucial to understand and extract information from the images, especially images that contain text data in some kind of tabular form. We find them in invoices, tax and bank statements, medical records, equipment, and facility-related logs.
Tables are information-rich structured objects in document images.
Tables are effective at summarising and communicating a complex information through only the precise and necessary data.
Understanding the structure of documents containing tables and extracting the text for the same has many applications including but not limited to visualization, analysis, information retrieval, content editing, and human-document interactions.
Table structure recognition refers to representation of a table in a machine-readable format, where its layout is encoded according to a pre-defined standard. It can be represented in It can be represented in the form of either physical or logical formats. While logical structure contains every cells row and column spanning information, physical structure additionally contains bounding box coordinates.
Tables can come in many variations which include different foreground and background colors, font shape, font size, simple/ complex structure, multicolumn or multiline cells, the difference in width, padding and design, text alignment, the density of table, etc which makes it challenging to recognize the table structure and generate structured text format.
Machines understanding the structure of scanned documents and handwriting tables, extracting the structured text from them (especially in CSV, XML, lxlsor some other data structure format) is the core and a battlefield for the multi-million document analysis industry.
This is not only because the table information extraction is challenging, but also because the task is highly demanding in terms of the accuracy and precision requirements, thanks to the criticality of the data that the tables typically represent.
TabStruct-Net
TabStruct-net is an end-to-end trainable deep learning architecture for cell detection and table structure recognition. TabStructNet uses table images as input (*and not the document image which contains the table) and tries to predict table structure, It uses a two-stage process, (a) top-down stage (decomposition stage): which is a cell detection (fundamental table objects) network based on the mask RCNN (modified FPN) (b) bottom-up stage(composition stage) which takes information from cell detection network (top-down stage), along with their row-column associations using adjacency matrix and rebuilt the entire table (bottom-up stage uses LSTM and GDCNN to predict cell associations and cell interaction).
Detecting table cells is a challenging problem due to:
- Different sizes of cells in the same table
- Cells alignment due to text amount
- Multi-line cells
- Lack of linguistic context in cells’ content
- Empty cells
To overcome these challenges, we introduce a novel loss function that models the inherent alignment of cells in the cell detection network; and a graph-based problem formulation to build associations between the detected cells.
The main contributions:
Our main contributions can be summarised as follows:
– We demonstrate how the top-down (cell detection) and bottom-up (structure recognition) cues can be combined visually to recognize table structures in document images.
– We present an end-to-end trainable network, termed as tabstruct-net for training cell detection and structure recognition networks in a joint manner.
– We formulate a novel loss function (i.e., alignment loss) to incorporate structural constraints between every pair of table cells a mid Network (fpn) to capture better low-level and long-range features for cell detection.
-We enhance the visual features representation for structure recognition (built on top of model [9]) through lstm.
– We unify results from previously published methods on table structure recognition for a thorough comparison study
“Our solution for table structure recognition progresses in three steps — (a) detection of table cells; (b) establishing row/column relationships between the detected cells, and © post-processing step to produce the XML output as desired. The above image depicts the block diagram of our approach.”
Top-Down: Cell Detection
For cell detection and localization the author has used the object detection algorithm Mask R-CNN with additional enhancements:
(a) The Feature pyramid network (FPN) from (2, 1) as implemented in the Matterport’s Mask_RCNN repository, modifying which the FPN present in TabStructNet is built. We note that the computational graphs for P2, P3, and P4 are similar. While the tensor placeholders P{N}T D do not even exist in Matterport’s Mask_RCNN, these have been marked in gray above at equivalent places, to help simplify the comparison between these two FPN architectures. (b) The FPN from TabStructNet (3) includes both the ‘Bottom Up’ and ‘Top Down’ pathways. Notice that, while P3 and P4 computation graphs are similar (i. e., a summation of 3 inputs, followed by a 2-d convolution), P2 and P5 computation graphs are both different, featuring a summation operation over only two inputs. This, of course, is not a criticism of architecture. We only note the perceived contradictions concerning what Figure 5 from (3) leads one to believe. © Redrawn Figure 5 from (3) for easy comparison.
The key difference between the FPN from the Matterport Mask-RNN and that from the TabStructNet pretrained model is 79 the newly introduced bottom-up pathway in the FPN of TabStructNet. Notice from Figure 2(b) that the graph structures 80 in the top-down and bottom-up pathways are different for the {C2, C3, C4, C5} to {P2, P3, P4, P5} computations. 81 Specifically, for N = {3, 4}, P{N} tensors are results of a 2-d convolution operation over a summation of three 82 tensors, Conv2D(C{N}), P{N}T D and P{N}BU. However, for N = {2, 5}, P{N} tensors are results of a 2-d 83 convolution operation over a summation of only two tensors each. Formally, P2 = Conv2D(Conv2D(C2)(:= P2BU) + P2T D), P3 = Conv2D(Conv2D(C3) + P3T D + P3BU), P4 = Conv2D(Conv2D(C4) + P4T D + P4BU), P5 = Conv2D(Conv2D(C5)(:= P5T D) + P5BU), where, P2BU = Conv2D( C2 ) , P5T D = Conv2D(C5), P3BU =M axP ool2D( Conv2D(P2BU) ) , P4T D = U pSample2D(P5T D), P4BU =M axP ool2D( Conv2D(P3BU) ) , P3T D = U pSample2D(P4T D), P5BU =M axP ool2D( Conv2D(P4BU) ) , P2T D = U pSample2D(P3T D).
Here is a quote from the paper itself:
— (a) we augment the Region Proposal Network (RPN) with dilated convolutions [48, 49] to better capture long-range row and column visual features of the table. This improves detection of multi-row/column spanning and multi-line cells ; (b) inspired by [50], we append the feature pyramid network with a top-down pathway, which propagates high-level semantic information to low-level feature maps. This allows the network to work better for cells with varying scales; and © we append additional losses during the training phase in order to model the inherent structural constraints
We formulate two ways of incorporating this information — (i) through an end-to-end training of cell detection and the structure recognition networks (explained next), and (ii) through a novel alignment loss function. For the latter, we make use of the fact that every pair of cells is aligned horizontally if they span the same row and aligned vertically if they span the same column. For the ground truth, where tight bounding boxes around the cells’ content are provided [18, 14, 13], we employ an additional ground truth pre-processing step to ensure that bounding boxes of cells in the same row and same column are aligned vertically and horizontally, respectively. We model these constraints during the training in the following manner:
Here, SR, SC, ER and EC represent starting row, starting column, ending row and ending column indices as shown in Figure 4. Also, ci and cj denote two cells in a particular row r or column c; x1ci , y1ci , x2ci and y2ci represent bounding box coordinates X-start, Y-start, X-end and Y-end respectively of the cell ci . These losses (L1, L2, L3, L4) can be interpreted as constraints that enforce proper alignment of cells beginning from same row, ending on same row, beginning from same column and ending on same column respectively.
Alignment loss is defined as
Output by the Cell detection network
Cell detection network is the modified MaskRCNN network, so the output would be a mask, detected cell coordinates, and their respective probability.
tablecell 0.9990501 7 55 434 108 tablecell 0.9985606 0 375 416 423 tablecell 0.9984049 0 212 432 260 tablecell 0.998395 434 458 676 505 tablecell 0.9983854 3 457 416 507
Cell coordinate and their respective objectiveness score.
*NOTE: Notice how the “Total” cell in the first column is not been detected.
Bottom-Up: Structure Recognition
The bottom-up network uses feature maps from P2 and cell coordinates prediction (cell detection) from the top-down network to establish relationships between predicted cells.
We formulate the table structure recognition using graphs similar to [9]. We consider each cell of the table as a vertex and construct two adjacency matrices — a row matrix Mrow and a column matrix Mcol which describe the association between cells with respect to rows and columns. Mrow, Mcol ∈ R Ncells×Ncells . Mrowi,j = 1 or Mcoli,j = 1 if cells i, j belong to the same row or column, else 0. The structure recognition network aims to predict row and column relationships between the cells predicted by the cell detection module during training and testing. During training, only those predicted table cells are used for structure recognition which overlap with the ground truth table cells having an IoU greater than or equal to 0.5. This network has three components: — Visual Component: We use visual features from P2 layer (refer Figure 5) of the feature pyramid based on the linear interpolation of cell bounding boxes predicted by the cell detection module. In order to encode cells’ visual characteristics across their entire height and width, we pass the gathered P2 features for every cell along their centre horizontal and centre vertical lines using lstm [51] to obtain the final visual features (refer Figure 5) (as opposed to visual features corresponding to cells’ centroids only as in [52]). — Interaction Component: We use the dgcnn architecture based on graph neural networks used in [52] to model the interaction between geometrically neighboring detected cells. It’s output, termed as interaction features, is a fixed dimensional vector for every cell that has information aggregated from its neighbouring table cells. — Classification Component: For a pair of table cells, the interaction features are concatenated and appended with difference between cells’ bounding box coordinates. This is fed as an input to the row/column classifiers to predict row/column associations. Please note that we use the same [52] Monte Carlo based sampling to ensure efficient training and class balancing. During testing time, however, predictions are made for every unique pair of table cells.
We train the cell detection and structure recognition networks in a joint manner (termed as tabstruct-net) to collectively predict cell bounding boxes along with row and column adjacency matrices. Further, the two structure recognition pathways for row and column adjacency matrices are put together in parallel. The visual features prepared using lstms for every vertex are duplicated for both the pathways, after which they work in a parallel manner. The overall empirical loss of tabstruct-net is given by: L = Lbox + Lcls + Lmask + Lalign + Lgnn, (2) where Lbox, Lcls and Lmask are bounding box regression loss, classification loss and mask loss, respectively defined in Mask r-cnn [47], Lalign is alignment loss which is modeled as a regularizer (defined in Eq. 1) and Lgnn is the cross-entropy loss back propagated from the structure recognition module of tabstruct-net. The additional loss components help the model in better alignment of cells belonging to same rows/columns during training, and in a sense fine-tunes the predicted bounding boxes that makes it easier for post-processing and structure recognition in the subsequent step.
Output by the Structure Recognition network
Network output by structure recognition network is two adjacency matrices corresponding to row and column each.
Here is how it looks:
There are 30 detected cells and their respective column relation (The same goes with rows).
Post Processing
After getting output from the structure recognition network we create an XML representation of the row and column adjacency matrix.
From the cell coordinates along with row and column adjacency matrix, SR, SC, ER and EC indexes are assigned to each cell, which indicate spanning of that cell along rows and columns
Output by a post-processing step
Here is how the output of the post-processing step would look like:
Notice how in the tablecell
tag we have end_col
, end_row
, stat_col
, start_row
sub tag providing information about cell relation.
We use Tesseract [53] to extract the content of every predicted cell. The xml output for every table image finally contains coordinates of predicted cell bounding boxes and along with cell spanning information and its content.
Evaluation and experimentation
Please read the Evaluation and Results on Table Structure Recognition section of the original paper.
Table2SText(An Android Application)
Table image to structured text (Table2SText) is the alteration and deployment of tableStructNet as an android app. As the original repo requires a lot of manual work which is:
- Download the pre-trained model for evaluation.
- Execute the command to get start evaluation.
- Manually move the output files by the tableStructNet to the directory where we post-process the output to generate XML representation.
- Execute the command to generate the XML file.
- And finally, process the XML file to generate a CSV or xlsx format which contains actual text (thanks to tesseract technology) in a structured way.
These steps are really annoying when you just want to get the formatted text out of the table image.
These are the steps that have been aggregated to a single step in my fork of tableStructNet.
The workflow of the app is depicted by the below flow diagram:
Flowchart of Table2SText
The above figure shows how Table2SText works under the hood.
- Users first have to register using a mobile number.
- Take a picture using a camera or use an existing picture of a table document.
- The picture gets uploaded to cloud storage.
- After successful upload action app will insert metadata to the real-time database (RTDB).
- Insertion to RTDB triggers cloud function which pre-processes the data makes the prediction, and post-process the data further. We are using google vision here for the image to text rather than plain old tesseract.
- Cloud function sends a notification to the end device to notify about the result
- Table2SText downloads the data, run further post-processing, and present the result to the user.
*NOTE: Notice how the image has vertical lines drawn by marker, this is a way to provide table structure information to tableStructNet architecture. We may provide this information virtually by drawing lines in the app itself for much more accurate table structure recognization.
The main caveat here is that it takes a long time for tableStructNet to predict the output. First due to the cold start problem it takes around 2 minutes to get the first prediction.
A cold start happens when you execute an inactive function. The delay comes from your cloud provider provisioning your selected runtime container and then running your function. In a nutshell, this process will considerably increase your execution time.
Once we have an active container (run time environment in which our code runs), it takes around 1 minute to get the result. It is still a huge about of time for such a simple task.
Future work
- As many times a camera might not capture the whole document because the document is too long, to cope with this issue we may use the video camera to capture the whole document and process it piece by piece (Table videos to structured text).
- Deploying the model in the user’s end device so that user data won’t need to leave the device. This is an enormously complex task because it includes architecture like MaskRCNN, LSTM, DGCNN, and a lot of pre-processing and post-processing, converting the model to a Lite version using TF-LITE is really difficult.
- Using Federated learning to make our algorithm smarter by training model on end-devices and sending weights(not the data) to the server, which averages the weights by all participant end devices to update the machine learning model's weights. You can refer to this blog to understand Federated learning better.
Hope you have found this blog useful. If you have any questions or concerns feel free to post them in the comment section below or connect with me on LinkedIn.
THANK YOU!!
References
Getting the Intuition of Graph Neural Networks | by Inneke Mayachita | Analytics Vidhya | Medium
Federated Learning: A Step by Step Implementation in Tensorflow