Summarizer
in package
Base class for all summarizers. Summarizers chief method is getSummary which is supposed to take a text or XML document and produces a summary of that document up to PageProcessor::$max_description_len many characters. Summarizers also contain various methods to generate word cloud from such a summary
Tags
Table of Contents
- CENTROID_COMPONENTS = 1000
- Number of nonzero centroid components
- MAX_DISTINCT_TERMS = 1000
- Number of distinct terms to use in generating summary
- WORD_CLOUD_LEN = 5
- Number of words in word cloud
- computeTermFrequenciesPerSentence() : array<string|int, mixed>
- Splits sentences into terms and returns [array of terms, array normalized term frequencies]
- formatDoc() : string
- Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.
- formatSentence() : string
- Formats the sentences to remove all characters except words, digits and spaces
- getPunctuatedUnpunctuatedSentences() : array<string|int, mixed>
- Breaks any content into sentences with and without punctuation
- getSentences() : array<string|int, mixed>
- Breaks any content into sentences by splitting it on spaces or carriage returns
- getSummary() : array<string|int, mixed>
- Compute a summary, word cloud, and scores for text ranges within the summary of a document in a given language
- getSummaryFromSentenceScores() : array<string|int, mixed>
- Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.
- getTermFrequencies() : array<string|int, mixed>
- Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences
- getTermsFromSentences() : array<string|int, mixed>
- Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.
- numSentencesForSummary() : int
- Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.
- pageProcessing() : string
- This function does an additional processing on the page such as removing all the tags from the page
- removePunctuation() : array<string|int, mixed>
- Remove punctuation from an array of sentences
- removeStopWords() : array<string|int, mixed>
- Returns a new array of sentences without the stop words
- wordCloudFromSummary() : array<string|int, mixed>
- Generates an array of most important words from a string $summary.
- wordCloudFromTermVector() : array<string|int, mixed>
- Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms
Constants
CENTROID_COMPONENTS
Number of nonzero centroid components
public
mixed
CENTROID_COMPONENTS
= 1000
MAX_DISTINCT_TERMS
Number of distinct terms to use in generating summary
public
mixed
MAX_DISTINCT_TERMS
= 1000
WORD_CLOUD_LEN
Number of words in word cloud
public
mixed
WORD_CLOUD_LEN
= 5
Methods
computeTermFrequenciesPerSentence()
Splits sentences into terms and returns [array of terms, array normalized term frequencies]
public
static computeTermFrequenciesPerSentence(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
- $sentences : array<string|int, mixed>
-
the array of sentences to process
- $lang : string
-
the current locale
Return values
array<string|int, mixed> —an array with [array of terms, array normalized term frequencies] pairs
formatDoc()
Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.
public
static formatDoc(string $content) : string
The formatted document generated by this function is only used to compute centroid.
Parameters
- $content : string
-
formatted page.
Return values
string —formatted document.
formatSentence()
Formats the sentences to remove all characters except words, digits and spaces
public
static formatSentence(string $sentence) : string
Parameters
- $sentence : string
-
complete page.
Return values
string —formatted sentences.
getPunctuatedUnpunctuatedSentences()
Breaks any content into sentences with and without punctuation
public
static getPunctuatedUnpunctuatedSentences(object $dom, string $content, string $lang) : array<string|int, mixed>
Parameters
- $dom : object
-
a document object to extract a description from.
- $content : string
-
complete page.
- $lang : string
-
local tag of the language for data being processed
Return values
array<string|int, mixed> —array [sentences_with_punctuation, sentences_with_punctuation_stripped]
getSentences()
Breaks any content into sentences by splitting it on spaces or carriage returns
public
static getSentences(string $content) : array<string|int, mixed>
Parameters
- $content : string
-
complete page.
Return values
array<string|int, mixed> —array of sentences from that content.
getSummary()
Compute a summary, word cloud, and scores for text ranges within the summary of a document in a given language
public
static getSummary(object $dom, string $page, string $lang) : array<string|int, mixed>
Parameters
- $dom : object
-
document object model used to locate items for summary
- $page : string
-
raw document sentences should be extracted from
- $lang : string
-
locale tag for language the summary is in
Return values
array<string|int, mixed> —[$summary, $word_cloud, $summary_scores]
getSummaryFromSentenceScores()
Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.
public
static getSummaryFromSentenceScores(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
- $sentence_scores : array<string|int, mixed>
-
an array sorted by score of sentence_index => score pairs.
- $sentences : array<string|int, mixed>
-
the array of sentences corresponding to sentence $sentence_scores indices
- $lang : string
-
language of the page to decide which stop words to call proper tokenizer.php of the specified language.
Return values
array<string|int, mixed> —a string that represents the summary, a vector of pairs (pos, score)
getTermFrequencies()
Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences
public
static getTermFrequencies(array<string|int, mixed> $terms, mixed $sentence_or_sentences) : array<string|int, mixed>
Parameters
- $terms : array<string|int, mixed>
-
the list of all terms in the doc
- $sentence_or_sentences : mixed
-
either a single string sentence or an array of sentences
Return values
array<string|int, mixed> —sequence of term => frequency pairs
getTermsFromSentences()
Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.
public
static getTermsFromSentences(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
- $sentences : array<string|int, mixed>
-
the sentences in the doc
- $lang : string
-
locale tag for stemming
Return values
array<string|int, mixed> —an array of terms in the array of sentences
numSentencesForSummary()
Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.
public
static numSentencesForSummary(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences) : int
Parameters
- $sentence_scores : array<string|int, mixed>
-
associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).
- $sentences : array<string|int, mixed>
-
sentences in doc in their original order
Return values
int —number of sentences
pageProcessing()
This function does an additional processing on the page such as removing all the tags from the page
public
static pageProcessing(string $page) : string
Parameters
- $page : string
-
complete page.
Return values
string —processed page.
removePunctuation()
Remove punctuation from an array of sentences
public
static removePunctuation(array<string|int, mixed> $sentences) : array<string|int, mixed>
Parameters
- $sentences : array<string|int, mixed>
-
the sentences in the doc
Return values
array<string|int, mixed> —the array of sentences with the punctuation removed
removeStopWords()
Returns a new array of sentences without the stop words
public
static removeStopWords(array<string|int, mixed> $sentences, object $stop_obj) : array<string|int, mixed>
Parameters
- $sentences : array<string|int, mixed>
-
the array of sentences to process
- $stop_obj : object
-
the class that has the stopworedRemover method
Return values
array<string|int, mixed> —a new array of sentences without the stop words
wordCloudFromSummary()
Generates an array of most important words from a string $summary.
public
static wordCloudFromSummary(string $summary, string $lang[, array<string|int, mixed> $term_frequencies = null ]) : array<string|int, mixed>
Currently, the algorithm is a based on terms frequencies after stopwords removed
Parameters
- $summary : string
-
text to derive most important words of
- $lang : string
-
locale tag for language of $summary
- $term_frequencies : array<string|int, mixed> = null
-
a supplied list of terms and frequencies for words in summary. If null then these will be computed.
Return values
array<string|int, mixed> —the top self::WORD_CLOUD_LEN most important terms in $summary
wordCloudFromTermVector()
Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms
public
static wordCloudFromTermVector(array<string|int, mixed> $term_vector[, mixed $terms = false ]) : array<string|int, mixed>
Parameters
- $term_vector : array<string|int, mixed>
-
if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.
- $terms : mixed = false
-
if not false, then should be an array of terms, at a minimum having all the indices of $term_vector
Return values
array<string|int, mixed> —the top self::WORD_CLOUD_LEN most important terms in $summary