#include <pageiterator.h>

Inheritance diagram for tesseract::PageIterator:

Collaboration diagram for tesseract::PageIterator:

[legend]

Public Member Functions
	PageIterator (PAGE_RES page_res, Tesseract tesseract, int scale, int scaled_yres, int rect_left, int rect_top, int rect_width, int rect_height)

virtual	~PageIterator ()

	PageIterator (const PageIterator &src)

const PageIterator &	operator= (const PageIterator &src)

bool	PositionedAtSameWord (const PAGE_RES_IT *other) const

virtual void	Begin ()

virtual void	RestartParagraph ()

bool	IsWithinFirstTextlineOfParagraph () const

virtual void	RestartRow ()

virtual bool	Next (PageIteratorLevel level)

virtual bool	IsAtBeginningOf (PageIteratorLevel level) const

virtual bool	IsAtFinalElement (PageIteratorLevel level, PageIteratorLevel element) const

int	Cmp (const PageIterator &other) const

void	SetBoundingBoxComponents (bool include_upper_dots, bool include_lower_dots)

bool	BoundingBox (PageIteratorLevel level, int left, int top, int right, int bottom) const

bool	BoundingBox (PageIteratorLevel level, const int padding, int left, int top, int right, int bottom) const

bool	BoundingBoxInternal (PageIteratorLevel level, int left, int top, int right, int bottom) const

bool	Empty (PageIteratorLevel level) const

PolyBlockType	BlockType () const

Pta *	BlockPolygon () const

Pix *	GetBinaryImage (PageIteratorLevel level) const

Pix *	GetImage (PageIteratorLevel level, int padding, Pix original_img, int left, int *top) const

bool	Baseline (PageIteratorLevel level, int x1, int y1, int x2, int y2) const

void	Orientation (tesseract::Orientation orientation, tesseract::WritingDirection writing_direction, tesseract::TextlineOrder textline_order, float deskew_angle) const

void	ParagraphInfo (tesseract::ParagraphJustification justification, bool is_list_item, bool is_crown, int first_line_indent) const

bool	SetWordBlamerBundle (BlamerBundle *blamer_bundle)

Protected Member Functions
TESS_LOCAL void	BeginWord (int offset)

Protected Attributes
PAGE_RES *	page_res_

Tesseract *	tesseract_

PAGE_RES_IT *	it_

WERD *	word_

int	word_length_

int	blob_index_

C_BLOB_IT *	cblob_it_

bool	include_upper_dots_

bool	include_lower_dots_

int	scale_

int	scaled_yres_

int	rect_left_

int	rect_top_

int	rect_width_

int	rect_height_

Detailed Description

Class to iterate over tesseract page structure, providing access to all levels of the page hierarchy, without including any tesseract headers or having to handle any tesseract structures. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES. See apitypes.h for the definition of PageIteratorLevel. See also ResultIterator, derived from PageIterator, which adds in the ability to access OCR output with text-specific methods.

Constructor & Destructor Documentation

◆ PageIterator() [1/2]

tesseract::PageIterator::PageIterator	(	PAGE_RES *	page_res,
		Tesseract *	tesseract,
		int	scale,
		int	scaled_yres,
		int	rect_left,
		int	rect_top,
		int	rect_width,
		int	rect_height
	)

page_res and tesseract come directly from the BaseAPI. The rectangle parameters are copied indirectly from the Thresholder, via the BaseAPI. They represent the coordinates of some rectangle in an original image (in top-left-origin coordinates) and therefore the top-left needs to be added to any output boxes in order to specify coordinates in the original image. See TessBaseAPI::SetRectangle. The scale and scaled_yres are in case the Thresholder scaled the image rectangle prior to thresholding. Any coordinates in tesseract's image must be divided by scale before adding (rect_left, rect_top). The scaled_yres indicates the effective resolution of the binary image that tesseract has been given by the Thresholder. After the constructor, Begin has already been called.

◆ ~PageIterator()

tesseract::PageIterator::~PageIterator ( )

virtual

◆ PageIterator() [2/2]

tesseract::PageIterator::PageIterator ( const PageIterator & src )

Page/ResultIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level. These constructors DO NOT CALL Begin, so iterations will continue from the location of src.

PageIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level.

Member Function Documentation

◆ Baseline()

bool tesseract::PageIterator::Baseline	(	PageIteratorLevel	level,
		int *	x1,
		int *	y1,
		int *	x2,
		int *	y2
	)		const

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical! Returns false if there is no baseline at the current position.

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical!

◆ Begin()

void tesseract::PageIterator::Begin ( )

virtual

Moves the iterator to point to the start of the page to begin an iteration.

Resets the iterator to point to the start of the page.

Reimplemented in tesseract::ResultIterator.

◆ BeginWord()

void tesseract::PageIterator::BeginWord ( int offset )

protected

Sets up the internal data for iterating the blobs of a new word, then moves the iterator to the given offset.

◆ BlockPolygon()

Pta * tesseract::PageIterator::BlockPolygon ( ) const

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use. Note that the returned Pta lists the vertices of the polygon, and the last edge is the line segment between the last point and the first point. nullptr will be returned if the iterator is at the end of the document or layout analysis was not used.

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use.

◆ BlockType()

PolyBlockType tesseract::PageIterator::BlockType ( ) const

Returns the type of the current block. See apitypes.h for PolyBlockType.

◆ BoundingBox() [1/2]

bool tesseract::PageIterator::BoundingBox	(	PageIteratorLevel	level,
		int *	left,
		int *	top,
		int *	right,
		int *	bottom
	)		const

Returns the bounding rectangle of the current object at the given level. See comment on coordinate system above. Returns false if there is no such object at the current position. The returned bounding box is guaranteed to match the size and position of the image returned by GetBinaryImage, but may clip foreground pixels from a grey image. The padding argument to GetImage can be used to expand the image to include more foreground pixels. See GetImage below.

Returns the bounding rectangle of the current object at the given level in coordinates of the original image. See comment on coordinate system above. Returns false if there is no such object at the current position.

◆ BoundingBox() [2/2]

bool tesseract::PageIterator::BoundingBox	(	PageIteratorLevel	level,
		const int	padding,
		int *	left,
		int *	top,
		int *	right,
		int *	bottom
	)		const

◆ BoundingBoxInternal()

bool tesseract::PageIterator::BoundingBoxInternal	(	PageIteratorLevel	level,
		int *	left,
		int *	top,
		int *	right,
		int *	bottom
	)		const

Returns the bounding rectangle of the object in a coordinate system of the working image rectangle having its origin at (rect_left_, rect_top_) with respect to the original image and is scaled by a factor scale_.

Returns the bounding rectangle of the current object at the given level in the coordinates of the working image that is pix_binary(). See comment on coordinate system above. Returns false if there is no such object at the current position.

◆ Cmp()

int tesseract::PageIterator::Cmp ( const PageIterator & other ) const

Returns whether this iterator is positioned before other: -1 equal to other: 0 after other: 1

◆ Empty()

bool tesseract::PageIterator::Empty ( PageIteratorLevel level ) const

Returns whether there is no object of a given level.

Return that there is no such object at a given level.

◆ GetBinaryImage()

Pix * tesseract::PageIterator::GetBinaryImage ( PageIteratorLevel level ) const

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use.

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use. The following methods are used to generate the images: RIL_BLOCK: mask the page image with the block polygon. RIL_TEXTLINE: Clip the rectangle of the line box from the page image. TODO(rays) fix this to generate and use a line polygon. RIL_WORD: Clip the rectangle of the word box from the page image. RIL_SYMBOL: Render the symbol outline to an image for cblobs (prior to recognition) or the bounding box otherwise. A reconstruction of the original image (using xor to check for double representation) should be reasonably accurate, apart from removed noise, at the block level. Below the block level, the reconstruction will be missing images and line separators. At the symbol level, kerned characters will be invade the bounding box if rendered after recognition, making an xor reconstruction inaccurate, but an or construction better. Before recognition, symbol-level reconstruction should be good, even with xor, since the images come from the connected components.

◆ GetImage()

Pix * tesseract::PageIterator::GetImage	(	PageIteratorLevel	level,
		int	padding,
		Pix *	original_img,
		int *	left,
		int *	top
	)		const

Returns an image of the current object at the given level in greyscale if available in the input. To guarantee a binary image use BinaryImage. NOTE that in order to give the best possible image, the bounds are expanded slightly over the binary connected component, by the supplied padding, so the top-left position of the returned image is returned in (left,top). These will most likely not match the coordinates returned by BoundingBox. If you do not supply an original image, you will get a binary one. Use pixDestroy to delete the image after use.

◆ IsAtBeginningOf()

bool tesseract::PageIterator::IsAtBeginningOf ( PageIteratorLevel level ) const

virtual

Returns true if the iterator is at the start of an object at the given level.

For instance, suppose an iterator it is pointed to the first symbol of the first word of the third line of the second paragraph of the first block in a page, then: it.IsAtBeginningOf(RIL_BLOCK) = false it.IsAtBeginningOf(RIL_PARA) = false it.IsAtBeginningOf(RIL_TEXTLINE) = true it.IsAtBeginningOf(RIL_WORD) = true it.IsAtBeginningOf(RIL_SYMBOL) = true

Returns true if the iterator is at the start of an object at the given level. Possible uses include determining if a call to Next(RIL_WORD) moved to the start of a RIL_PARA.

Reimplemented in tesseract::ResultIterator.

◆ IsAtFinalElement()

bool tesseract::PageIterator::IsAtFinalElement	(	PageIteratorLevel	level,
		PageIteratorLevel	element
	)		const

virtual

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Here's some two-paragraph example

text. It starts off innocuously enough but quickly turns bizarre. The author inserts a cornucopia of words to guard against confused references.

Now take an iterator it pointed to the start of "bizarre." it.IsAtFinalElement(RIL_PARA, RIL_SYMBOL) = false it.IsAtFinalElement(RIL_PARA, RIL_WORD) = true it.IsAtFinalElement(RIL_BLOCK, RIL_WORD) = false

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Reimplemented in tesseract::ResultIterator.

◆ IsWithinFirstTextlineOfParagraph()

bool tesseract::PageIterator::IsWithinFirstTextlineOfParagraph ( ) const

Return whether this iterator points anywhere in the first textline of a paragraph.

◆ Next()

bool tesseract::PageIterator::Next ( PageIteratorLevel level )

virtual

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE that RIL_SYMBOL will skip non-text blocks, but all other PageIteratorLevel level values will visit each non-text block once. Think of non text blocks as containing a single para, with a single line, with a single imaginary word. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE (CHANGED!) that ALL PageIteratorLevel level values will visit each non-text block at least once. Think of non text blocks as containing a single para, with at least one line, with a single imaginary word, containing a single symbol. The bounding boxes mark out any polygonal nature of the block, and PTIsTextType(BLockType()) is false for non-text blocks. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Reimplemented in tesseract::ResultIterator.

◆ operator=()

const PageIterator & tesseract::PageIterator::operator= ( const PageIterator & src )

◆ Orientation()

void tesseract::PageIterator::Orientation	(	tesseract::Orientation *	orientation,
		tesseract::WritingDirection *	writing_direction,
		tesseract::TextlineOrder *	textline_order,
		float *	deskew_angle
	)		const

Returns orientation for the block the iterator points to. orientation, writing_direction, textline_order: see publictypes.h deskew_angle: after rotating the block so the text orientation is upright, how many radians does one have to rotate the block anti-clockwise for it to be level? -Pi/4 <= deskew_angle <= Pi/4

◆ ParagraphInfo()

void tesseract::PageIterator::ParagraphInfo	(	tesseract::ParagraphJustification *	justification,
		bool *	is_list_item,
		bool *	is_crown,
		int *	first_line_indent
	)		const

Returns information about the current paragraph, if available.

justification - LEFT if ragged right, or fully justified and script is left-to-right. RIGHT if ragged left, or fully justified and script is right-to-left. unknown if it looks like source code or we have very few lines. is_list_item - true if we believe this is a member of an ordered or unordered list. is_crown - true if the first line of the paragraph is aligned with the other lines of the paragraph even though subsequent paragraphs have first line indents. This typically indicates that this is the continuation of a previous paragraph or that it is the very first paragraph in the chapter. first_line_indent - For LEFT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the left edge of the rest of the paragraph. for RIGHT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the right edge of the rest of the paragraph. NOTE 1: This value may be negative. NOTE 2: if *is_crown == true, the first line of this paragraph is actually flush, and first_line_indent is set to the "common" first_line_indent for subsequent paragraphs in this block of text.

◆ PositionedAtSameWord()

bool tesseract::PageIterator::PositionedAtSameWord ( const PAGE_RES_IT * other ) const

Are we positioned at the same location as other?

◆ RestartParagraph()

void tesseract::PageIterator::RestartParagraph ( )

virtual

Moves the iterator to the beginning of the paragraph. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word on the first row of the paragraph.

◆ RestartRow()

void tesseract::PageIterator::RestartRow ( )

virtual

Moves the iterator to the beginning of the text line. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word of the row.

◆ SetBoundingBoxComponents()

void tesseract::PageIterator::SetBoundingBoxComponents	(	bool	include_upper_dots,
		bool	include_lower_dots
	)

inline

Controls what to include in a bounding box. Bounding boxes of all levels between RIL_WORD and RIL_BLOCK can include or exclude potential diacritics. Between layout analysis and recognition, it isn't known where all diacritics belong, so this control is used to include or exclude some diacritics that are above or below the main body of the word. In most cases where the placement is obvious, and after recognition, it doesn't make as much difference, as the diacritics will already be included in the word.

◆ SetWordBlamerBundle()

bool tesseract::PageIterator::SetWordBlamerBundle ( BlamerBundle * blamer_bundle )

Member Data Documentation

◆ blob_index_

int tesseract::PageIterator::blob_index_

protected

The current blob index within the word.

◆ cblob_it_

C_BLOB_IT* tesseract::PageIterator::cblob_it_

protected

Iterator to the blobs within the word. If nullptr, then we are iterating OCR results in the box_word. Owned by this ResultIterator.

◆ include_lower_dots_

bool tesseract::PageIterator::include_lower_dots_

protected

◆ include_upper_dots_

bool tesseract::PageIterator::include_upper_dots_

protected

Control over what to include in bounding boxes.

◆ it_

PAGE_RES_IT* tesseract::PageIterator::it_

protected

The iterator to the page_res_. Owned by this ResultIterator. A pointer just to avoid dragging in Tesseract includes.

◆ page_res_

PAGE_RES* tesseract::PageIterator::page_res_

protected

Pointer to the page_res owned by the API.

◆ rect_height_

int tesseract::PageIterator::rect_height_

protected

◆ rect_left_

int tesseract::PageIterator::rect_left_

protected

◆ rect_top_

int tesseract::PageIterator::rect_top_

protected

◆ rect_width_

int tesseract::PageIterator::rect_width_

protected

◆ scale_

int tesseract::PageIterator::scale_

protected

Parameters saved from the Thresholder. Needed to rebuild coordinates.

◆ scaled_yres_

int tesseract::PageIterator::scaled_yres_

protected

◆ tesseract_

Tesseract* tesseract::PageIterator::tesseract_

protected

Pointer to the Tesseract object owned by the API.

◆ word_

WERD* tesseract::PageIterator::word_

protected

The current input WERD being iterated. If there is an output from OCR, then word_ is nullptr. Owned by the API

◆ word_length_

int tesseract::PageIterator::word_length_

protected

The length of the current word_.

The documentation for this class was generated from the following files:

/home/stephane/src/tesseract/src/ccmain/pageiterator.h
/home/stephane/src/tesseract/src/ccmain/pageiterator.cpp

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

◆ PageIterator() [1/2]

◆ ~PageIterator()

◆ PageIterator() [2/2]

Member Function Documentation

◆ Baseline()

◆ Begin()

◆ BeginWord()

◆ BlockPolygon()

◆ BlockType()

◆ BoundingBox() [1/2]

◆ BoundingBox() [2/2]

◆ BoundingBoxInternal()

◆ Cmp()

◆ Empty()

◆ GetBinaryImage()

◆ GetImage()

◆ IsAtBeginningOf()

◆ IsAtFinalElement()

◆ IsWithinFirstTextlineOfParagraph()

◆ Next()

◆ operator=()

◆ Orientation()

◆ ParagraphInfo()

◆ PositionedAtSameWord()

◆ RestartParagraph()

◆ RestartRow()

◆ SetBoundingBoxComponents()

◆ SetWordBlamerBundle()

Member Data Documentation

◆ blob_index_

◆ cblob_it_

◆ include_lower_dots_

◆ include_upper_dots_

◆ it_

◆ page_res_

◆ rect_height_

◆ rect_left_

◆ rect_top_

◆ rect_width_

◆ scale_

◆ scaled_yres_

◆ tesseract_

◆ word_

◆ word_length_