Yioop_V9.5_Source_Code

GraphBasedSummarizer extends Summarizer
in package

Application

Class which may be used by TextProcessors to get a summary for a text document that may later be used for indexing. The method @see getSummary is used to obtain such a summary. In GraphBasedSummarizer's implementation of this method sentences are ranks using a page rank style algorithm based on sentence adjacencies calculated using a distortion score between pair of sentence (@see LinearAlgebra::distortion for details on this).

The page rank is then biased using a Zipf-like transformation to slightly favor sentences earlier in the document

CENTROID_COMPONENTS

Number of nonzero centroid components


    public
        mixed
    CENTROID_COMPONENTS
    = 1000

MAX_DISTINCT_TERMS

Number of distinct terms to use in generating summary


    public
        mixed
    MAX_DISTINCT_TERMS
    = 1000

WORD_CLOUD_LEN

Number of words in word cloud


    public
        mixed
    WORD_CLOUD_LEN
    = 5

computeAdjacency()

Compute the adjacency matrix based on its distortion measure


    public
            static        computeAdjacency(array<string|int, mixed> $tf_per_sentence_normalized) : array<string|int, mixed>

Parameters

$tf_per_sentence_normalized : array<string|int, mixed>: the array of term frequencies

Return values

array<string|int, mixed> —

the array of sentence adjacency

computeTermFrequenciesPerSentence()

Splits sentences into terms and returns [array of terms, array normalized term frequencies]


    public
            static        computeTermFrequenciesPerSentence(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>

Parameters

$sentences : array<string|int, mixed>: the array of sentences to process
$lang : string: the current locale

Return values

array<string|int, mixed> —

an array with [array of terms, array normalized term frequencies] pairs

formatDoc()

Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.


    public
            static        formatDoc(string $content) : string

The formatted document generated by this function is only used to compute centroid.

Parameters

$content : string: formatted page.

Return values

string —

formatted document.

formatSentence()

Formats the sentences to remove all characters except words, digits and spaces


    public
            static        formatSentence(string $sentence) : string

Parameters

$sentence : string: complete page.

Return values

string —

formatted sentences.

getPunctuatedUnpunctuatedSentences()

Breaks any content into sentences with and without punctuation


    public
            static        getPunctuatedUnpunctuatedSentences(object $dom, string $content, string $lang) : array<string|int, mixed>

Parameters

$dom : object: a document object to extract a description from.
$content : string: complete page.
$lang : string: local tag of the language for data being processed

Return values

array<string|int, mixed> —

array [sentences_with_punctuation, sentences_with_punctuation_stripped]

getSentenceRanks()

Compute the sentence ranks using power method.


    public
            static        getSentenceRanks(array<string|int, mixed> $adjacency) : array<string|int, mixed>

Take adjacency matrix and apply it 10 times to starting sentence ranks, all of which are 1/n where n is the number of sentences. We assume pr = (A^{10}r) approximates (A^{11}r) and so A*pr = pr, i.e, the pr vector is a eigentvector of A and its components approximate the importance of each sentence. After computing ranks in this way, we multiply the components of the resulting vector to slightly bias the results to favor earlier sentences in the document using a Zipf-like distribution on sentence order.

Parameters

$adjacency : array<string|int, mixed>: the adjacency matrix (normalized to satisfy conditions for power method to converge) generated for the sentences

Return values

array<string|int, mixed> —

the sentence ranks

getSentences()

Breaks any content into sentences by splitting it on spaces or carriage returns


    public
            static        getSentences(string $content) : array<string|int, mixed>

Parameters

$content : string: complete page.

Return values

array<string|int, mixed> —

array of sentences from that content.

getSummary()

This summarizer uses a page rank-like algorithm to find the important sentences in a document, generate a word cloud, and give scores for those sentences.


    public
            static        getSummary(object $dom, string $page, string $lang) : array<string|int, mixed>

Parameters

$dom : object: document object model of page to summarize
$page : string: complete raw page to generate the summary from.
$lang : string: language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Return values

array<string|int, mixed> —

a triple (string summary, array word cloud, array of position => scores for positions within the summary)

getSummaryFromSentenceScores()

Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.


    public
            static        getSummaryFromSentenceScores(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>

Parameters

$sentence_scores : array<string|int, mixed>: an array sorted by score of sentence_index => score pairs.
$sentences : array<string|int, mixed>: the array of sentences corresponding to sentence $sentence_scores indices
$lang : string: language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Return values

array<string|int, mixed> —

a string that represents the summary, a vector of pairs (pos, score)

getTermFrequencies()

Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences


    public
            static        getTermFrequencies(array<string|int, mixed> $terms, mixed $sentence_or_sentences) : array<string|int, mixed>

Parameters

$terms : array<string|int, mixed>: the list of all terms in the doc
$sentence_or_sentences : mixed: either a single string sentence or an array of sentences

Return values

array<string|int, mixed> —

sequence of term => frequency pairs

getTermsFromSentences()

Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.


    public
            static        getTermsFromSentences(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>

Parameters

$sentences : array<string|int, mixed>: the sentences in the doc
$lang : string: locale tag for stemming

Return values

array<string|int, mixed> —

an array of terms in the array of sentences

numSentencesForSummary()

Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.


    public
            static        numSentencesForSummary(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences) : int

Parameters

$sentence_scores : array<string|int, mixed>: associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).
$sentences : array<string|int, mixed>: sentences in doc in their original order

Return values

int —

number of sentences

pageProcessing()

This function does an additional processing on the page such as removing all the tags from the page


    public
            static        pageProcessing(string $page) : string

Parameters

$page : string: complete page.

Return values

string —

processed page.

removePunctuation()

Remove punctuation from an array of sentences


    public
            static        removePunctuation(array<string|int, mixed> $sentences) : array<string|int, mixed>

Parameters

$sentences : array<string|int, mixed>: the sentences in the doc

Return values

array<string|int, mixed> —

the array of sentences with the punctuation removed

removeStopWords()

Returns a new array of sentences without the stop words


    public
            static        removeStopWords(array<string|int, mixed> $sentences, object $stop_obj) : array<string|int, mixed>

Parameters

$sentences : array<string|int, mixed>: the array of sentences to process
$stop_obj : object: the class that has the stopworedRemover method

Return values

array<string|int, mixed> —

a new array of sentences without the stop words

wordCloudFromSummary()

Generates an array of most important words from a string $summary.


    public
            static        wordCloudFromSummary(string $summary, string $lang[, array<string|int, mixed> $term_frequencies = null ]) : array<string|int, mixed>

Currently, the algorithm is a based on terms frequencies after stopwords removed

Parameters

$summary : string: text to derive most important words of
$lang : string: locale tag for language of $summary
$term_frequencies : array<string|int, mixed> = null: a supplied list of terms and frequencies for words in summary. If null then these will be computed.

Return values

array<string|int, mixed> —

the top self::WORD_CLOUD_LEN most important terms in $summary

wordCloudFromTermVector()

Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms


    public
            static        wordCloudFromTermVector(array<string|int, mixed> $term_vector[, mixed $terms = false ]) : array<string|int, mixed>

Parameters

$term_vector : array<string|int, mixed>: if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.
$terms : mixed = false: if not false, then should be an array of terms, at a minimum having all the indices of $term_vector

Return values

array<string|int, mixed> —

the top self::WORD_CLOUD_LEN most important terms in $summary

GraphBasedSummarizer extends Summarizer in package Application

Tags

Table of Contents

Constants

CENTROID_COMPONENTS

MAX_DISTINCT_TERMS

WORD_CLOUD_LEN

Methods

computeAdjacency()

Parameters

Return values

computeTermFrequenciesPerSentence()

Parameters

Return values

formatDoc()

Parameters

Return values

formatSentence()

Parameters

Return values

getPunctuatedUnpunctuatedSentences()

Parameters

Return values

getSentenceRanks()

Parameters

Return values

getSentences()

Parameters

Return values

getSummary()

Parameters

Return values

getSummaryFromSentenceScores()

Parameters

Return values

getTermFrequencies()

Parameters

Return values

getTermsFromSentences()

Parameters

Return values

numSentencesForSummary()

Parameters

Return values

pageProcessing()

Parameters

Return values

removePunctuation()

Parameters

Return values

removeStopWords()

Parameters

Return values

wordCloudFromSummary()

Parameters

Return values

wordCloudFromTermVector()

Parameters

Return values

GraphBasedSummarizer extends Summarizer
in package

Application