ImageProcessor
extends PageProcessor
in package
Base abstract class common to all processors used to create crawl summary information from images
Tags
Table of Contents
- $image_types : array<string|int, mixed>
- Array filetypes which should be considered images.
- $indexed_file_types : array<string|int, mixed>
- Array of file extensions which can be handled by the search engine, other extensions will be ignored.
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $mime_processor : array<string|int, mixed>
- Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
- $plugin_instances : array<string|int, mixed>
- indexing_plugins which might be used with the current processor
- $summarizer : object
- Stores the summarizer object used by this instance of page processor to be used in generating a summary
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $text_data : bool
- Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
- __construct() : mixed
- Set-ups the any indexing plugins associated with this page processor
- addWidthHeightSummary() : array<string|int, mixed>
- Given an $image_string determines if possible its width and height then assigns the values into the CrawlConstants:WIDTH, CrawlConstants:HEIGHT fields of $summary
- averageColor() : array<string|int, mixed>
- Computes the average RGBA pixel value over an image by resampling the image down to a 1x1 pixel image, then extracting its rgba value as a vector
- createThumb() : string
- Used to create a thumbnail from an image object
- getXmpData() : array<string|int, mixed>
- Given an image try to extract and XMP info from it.
- handle() : array<string|int, mixed>
- Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
- initializeIndexedFileTypes() : mixed
- Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
- isBlackAndWhite() : bool
- Checks if an image is Black and White (really gray scale) by sampling 200 points and check that for each point the rgb values are the same.
- process() : array<string|int, mixed>
- Extract summary data from the image provided in $page together the url in $url where it was downloaded from
- saveTempFile() : mixed
- Used to save a temporary file with the data downloaded for a url while carrying out image processing
Properties
$image_types
Array filetypes which should be considered images.
public
static array<string|int, mixed>
$image_types
= []
Sub-classes add to this array with the types they handle
$indexed_file_types
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
public
static array<string|int, mixed>
$indexed_file_types
= ["unknown"]
Sub-classes add to this array with the types they handle
$max_description_len
Max number of chars to extract for description from a page to index.
public
static int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
static int
$max_links_to_extract
$mime_processor
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
public
static array<string|int, mixed>
$mime_processor
= []
$plugin_instances
indexing_plugins which might be used with the current processor
public
array<string|int, mixed>
$plugin_instances
$summarizer
Stores the summarizer object used by this instance of page processor to be used in generating a summary
public
object
$summarizer
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$text_data
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
public
bool
$text_data
Methods
__construct()
Set-ups the any indexing plugins associated with this page processor
public
__construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, mixed $max_links_to_extract = null ][, int $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
- $plugins : array<string|int, mixed> = []
-
an array of indexing plugins which might do further processing on the data handles by this page processor
- $max_description_len : int = null
-
maximal length of a page summary
- $max_links_to_extract : mixed = null
- $summarizer_option : int = self::BASIC_SUMMARIZER
-
CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
Return values
mixed —addWidthHeightSummary()
Given an $image_string determines if possible its width and height then assigns the values into the CrawlConstants:WIDTH, CrawlConstants:HEIGHT fields of $summary
public
addWidthHeightSummary(array<string|int, mixed> &$summary, string $image_string) : array<string|int, mixed>
Parameters
- $summary : array<string|int, mixed>
-
to write the width and height into
- $image_string : string
-
the image represented as a character string
Return values
array<string|int, mixed> —summary information including a thumbnail and a description (where the description is just the url)
averageColor()
Computes the average RGBA pixel value over an image by resampling the image down to a 1x1 pixel image, then extracting its rgba value as a vector
public
static averageColor(GdImage $image) : array<string|int, mixed>
Parameters
- $image : GdImage
-
object to calculate average color for
Return values
array<string|int, mixed> —a 4-tuple with components [red, green, blue, alpha]
createThumb()
Used to create a thumbnail from an image object
public
static createThumb(object $image[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : string
Parameters
- $image : object
-
image object with image
- $width : int = CTHUMB_DIM
-
= width in pixels of thumb if width is a negative value and height positive, then this dimension will be set to be proportional based on the input images width versus height
- $height : int = CTHUMB_DIM
-
= height in pixels of thumb if height is a negative value and width positive, then this dimension will be set to be proportional based on the input images width versus height
Return values
string —of jpeg image if this string would have been non-blank empty string otherwise
getXmpData()
Given an image try to extract and XMP info from it.
public
getXmpData(string $image_string) : array<string|int, mixed>
Parameters
- $image_string : string
-
the image represented as a character string
Return values
array<string|int, mixed> —XMP data converted from XML format to an array-like format
handle()
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
public
handle(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a web document
- $url : string
-
location the document came from
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
initializeIndexedFileTypes()
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
public
static initializeIndexedFileTypes() : mixed
Return values
mixed —isBlackAndWhite()
Checks if an image is Black and White (really gray scale) by sampling 200 points and check that for each point the rgb values are the same.
public
static isBlackAndWhite(GdImage $image) : bool
Parameters
- $image : GdImage
-
object to check if black white
Return values
bool —true if black and white
process()
Extract summary data from the image provided in $page together the url in $url where it was downloaded from
public
process(string $page, string $url) : array<string|int, mixed>
ImageProcessor class defers a proper implementation of this method to subclasses
Parameters
- $page : string
-
the image represented as a character string
- $url : string
-
the url where the image was downloaded from
Return values
array<string|int, mixed> —summary information including a thumbnail and a description (where the description is just the url)
saveTempFile()
Used to save a temporary file with the data downloaded for a url while carrying out image processing
public
saveTempFile(string $page, string $url, string $file_extension) : mixed
Parameters
- $page : string
-
contains data about an image that one needs to save
- $url : string
-
where $page data came from
- $file_extension : string
-
to be associated with the $page data