Title: | Extracting Order Position Tables from PDF-Based Order Documents |
---|---|
Description: | Functions for extracting text and tables from PDF-based order documents. It provides an n-gram-based approach for identifying the language of an order document. It furthermore uses R-package 'pdftools' to extract the text from an order document. In the case that the PDF document is only including an image (because it is scanned document), R package 'tesseract' is used for OCR. Furthermore, the package provides functionality for identifying and extracting order position tables in order documents based on a clustering approach. |
Authors: | Michael Scholz [cre, aut], Joerg Bauer [aut] |
Maintainer: | Michael Scholz <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-12-13 05:25:05 UTC |
Source: | https://github.com/cran/orderanalyzer |
This packages provides functions for extracting text and order-position-tables from PDF-based order documents.
Package: | orderanalyzer |
Type: | Package |
Version: | 1.0.0 |
Date: | 2024-12-11 |
License: | GPL-3 |
Depends: | R (>= 4.3.0) |
Michael Scholz [email protected]
Joerg Bauer [email protected]
This function extracts order-position-tables from PDF-based order documents. It tries to identify table rows based on a clustering approach and thereafter identifies the column structure. A table row can consist of multiple text rows and the text rows can span different columns. This function furthermore tries to identify the meaning of the columns (position, articleID, description, quantity, quanity unit, unit price, total price, currency, date).
extractTables(text, minCols = 3, maxDistance = 20, entityNames = NA)
extractTables(text, minCols = 3, maxDistance = 20, entityNames = NA)
text |
List including several representations of text extracted from a PDF file. This list is generated by the function extractText. |
minCols |
Number of columns a table must minimal consist of |
maxDistance |
Number of text lines that can maximally exist between the start of two table rows |
entityNames |
A list of four name vectors (currencyUnits, quantityUnits, headerNames, noTableNames). Each vector contains strings that correspond to currency units, quantity units, header names or names of entities not being a table. |
List of lists describing the tables. Each sublist includes a data frame (data) which is the identified table, the position of text lines that constitute the table and the position of the significant lines.
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer") text <- extractText(file) # Extracting order tables without any further information tables <- extractTables(text) tables[[1]]$data # Extracting order tables with further information tables <- extractTables(text, entityNames = list(currencyUnits = enc2utf8(c("eur", "euro", "\u20AC")), quantityUnits = enc2utf8(c("pcs", "pcs.")), headerNames = enc2utf8(c("pos", "item", "quantity")), noTableNames = enc2utf8(c("order total", "supplier number"))) ) tables[[1]]$data # Extracting order tables from a German document file <- system.file("extdata", "OrderDocument_de.pdf", package = "orderanalyzer") text <- extractText(file) tables <- extractTables(text) tables[[1]]$data
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer") text <- extractText(file) # Extracting order tables without any further information tables <- extractTables(text) tables[[1]]$data # Extracting order tables with further information tables <- extractTables(text, entityNames = list(currencyUnits = enc2utf8(c("eur", "euro", "\u20AC")), quantityUnits = enc2utf8(c("pcs", "pcs.")), headerNames = enc2utf8(c("pos", "item", "quantity")), noTableNames = enc2utf8(c("order total", "supplier number"))) ) tables[[1]]$data # Extracting order tables from a German document file <- system.file("extdata", "OrderDocument_de.pdf", package = "orderanalyzer") text <- extractText(file) tables <- extractTables(text) tables[[1]]$data
This function extracts text from PDF documents and returns the text as a string, as a list of lines and as a list of words. It uses 'pdftools' to extract the content from textual PDF files and 'tesseract' to extract the content from image-based PDF-files.
extractText(file)
extractText(file)
file |
Path to the PDF file |
List including the extracted text, a data table including the lines, a data table including the words, the type and language of the document.
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer") text <- extractText(file) text$words
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer") text <- extractText(file) text$words
This function identifies the language of a given string based on the most frequent trigrams in different languages. Supported languages are Czech, Dutch, English, French, German, Spanish, Latvian and Lithuanian.
identifyLanguage(text)
identifyLanguage(text)
text |
String for which the language should be identified |
Name of the detected language.
text <- "The tea in the cup still is hot." language <- identifyLanguage(text) language
text <- "The tea in the cup still is hot." language <- identifyLanguage(text) language