How to use "Amazon TextRact" (AI Document Analysis Service) for beginners in Python: AWS Cheat Seat

notebook-laptop
27/01/2023
805 Views

This article is limited to members.You can see everything by registering (free).

　"AWS Cheat Seat" that briefly introduces convenient tricks for utilizing "Amazon Web Services" (AWS).This time, we will use AWS's AI document analysis service "Amazon TextRact" in Python.Here are some methods provided in Amazon TextRact and some usage.

　AWS has a large number of "AI services" that can easily use pre -training AI (artificial intelligence), and the content is a wide variety of computers, language, recommendation, and predictions.

　The Amazon TextRact (TEXTRACT) introduced this time is a service that recognizes prints and handwritten characters from scanned documents, and detects structured information such as tables and format without templates.This allows you to extract data faster and more flexibly compared to using a person input work or a simple optical character recognition (OCR) software.

Amazon Textract

　Although it is an important language, TextRact is not currently supported in Japanese.The English, Spanish, Italian, French, Portuguese, German, and handwritten characters are the English -speaking language.It is a pity that it is not supported in Japanese at this time, but please refer to this article as an article with a view to future updates or as an article assuming use in a language currently being supported.

AWS's "AI Service" can be used from the console screen, but even if you do not keep in mind the development, if you get used to it, you will feel more convenient and more convenient to use the API like this time.I hope that this article will trigger that way.

　TextRact is a pay -as -you -go -based system and has a different fee for four API operations according to the application (TextRact is not provided by Tokyo Region and Osaka Region, so the following is a reference for the Northern Burinia Region.I will post it).

1."Detect Document Text API" recognizes print and handwritten characters.Up to 1 million pages 0 per page.$ 0015 (US dollar, the same hereinafter), per page per page 0.It will be $ 006.

2.The "Table Analyze Document API" detects the structure of the table in addition to the perception of print and handwritten characters.Up to 1 million pages 0 per page.$ 015, if it exceeds 1 million pages 0 per page.It will be $ 01.

3.The "Form's Analyze Document API" is a "key" and "value" pair in the format in addition to the perception of prints and handwritten characters (for example, "Key" and corresponding [Yamada].", Etc.) Detect the structure.Up to 1 million pages 0 per page.$ 05, 05 million pages 0 per page.It will be $ 04.

Four.The "Analyze Expense API" detects pairs of corresponding values from the invoice or receipt (currently only English).Up to 1 million pages 0 per page.$ 01, 01 million pages 0 per page.It will be $ 008.

　TextRact is subject to the free use frame, up to 1000 pages using the Detect Document Text API per month for three months from the first request, up to 100 pages with the Analyze Document API or Analyze Expense API.You can analyze it for free (however, if you exceed the free usage slot, you will be charged.)

　By the way, the "page" above is one image in PNG or JPEG.Even if there are multiple pages in one scan image, TextRact is recognized as one page.In the case of PDF and TIFF, each page of the document is counted as a processed page.

　This article assumes that the following requirements are satisfied in the reader's environment.

　This is not required, but the following sample code is assumed to be executed on "Jupyter Notebook".

　TextRact has the following methods:


メソッド名	機能	引数	戻り値
detect_document_text	入力ドキュメント内の文字を認識する	ドキュメントデータ	辞書
analyze_document	入力ドキュメント内で検出されたアイテム間の関係を解析する	ドキュメントデータ、FeatureTypes値、HumanLoop構成	辞書
analyze_expense	請求書や領収書を解析する	ドキュメントデータ	辞書
start_document_text_detection	入力ドキュメント内の文字を認識する非同期処理を開始する	ドキュメントデータ、ClientRequestトークン、JobTag、通知チャネル設定、出力構成、KMSキーID	辞書
get_document_text_detection	start_document_text_detectionで開始したジョブの結果を取得する	JobID、NextToken、レスポンスの最大値	辞書
start_document_analysis	入力ドキュメント内で検出されたアイテム間の関係を解析する非同期処理を開始する	ドキュメントデータ、FeatureTypes値、ClientRequestトークン、JobTag、通知チャネル設定、出力構成、KMSキーID	辞書
get_document_analysis	start_document_analysisで開始したジョブの結果を取得する	JobID、NextToken、レスポンスの最大値	辞書
start_expense_analysis	請求書や領収書を解析する非同期処理を開始する	ドキュメントデータ、ClientRequestトークン、JobTag、通知チャネル設定、出力構成、KMSキーID	辞書
get_expense_analysis	start_expense_analysisで開始したジョブの結果を取得する	JobID、NextToken、レスポンスの最大値	辞書
can_paginate	各メソッドのページネーション有無を調べる	メソッド名	真偽値

　I will explain some.

　The file format supported by TextRact is JPEG and PNG in synchronous processing, and pdf and TIFF in addition to these in asynchronous processing.In the asynchronous process, you can analyze multiple pages documents with PDF and TIFF.

　In the synchronous process, there are two methods (up to 5MB) to pass the part -time data of the document at the time of API execution, and a method of specifying an object stored in the Amazon S3 bucket (up to 5MB) (all executed in this paper).increase).On the other hand, in asynchronous processing, specify the object stored in the S3 bucket.The S3 bucket must be created in the same region as the API endpoint when calling TextRact.

　When elements such as "pages", "line", "words" are detected from the document using TextRact, all of those information is described in the return value in a common unit called "Block".This Block information also describes the individual allocated ID, bounding box information, and information on child elements about its elements.Conversely, for what elements of each "Block" in the return value, refer to the value of the BlockType key in each block information.The value is "Page" for "page", "LINE" in "line", and "Word" for words."Word" is the child element of "LINE", and "LINE" is the child element of "Page".

　Here are some ways to use TextRact methods.

　The first introduced Detect_document_text method recognizes the character information in the input document.In the following, as a sampling, a letter sent by President Obama in 2016 to the Republican Senator Mark Kirk (both at the time) will be used.

引用元：Wikimedia Commons:https://commons.wikimedia.org/wiki/File:US_President_Obama_letter_to_Senator_Mark_Kirk_on_meeting_with_Merrick_Garland.jpg（Public domain））

　The target of the type and handwritten character to be recognized is as follows.

The White House

Washington

Mark-

Thank You for Fair and Responsible

Treatment of Merrick Garland.It Upholds

The Institute Values of the Senate, and Helps

Preserve The Bipartisan Ideals of an Independent

judiciary.

Rose CK Obama

　The content is thanks for the fact that Senator Kirk met Merrick Garland (President Obama was nominated as a Federal Supreme Court candidate, unable to take over the Republican Party).

続きを閲覧するには、ブラウザの JavaScript の設定を有効にする必要があります。