Yioop_V9.5_Source_Code_Documentation

CentroidWeightedSummarizer extends Summarizer
in package

Class which may be used by TextProcessors to get a summary for a text document that may later be used for indexing. This is done by the @see getSummmary method. To generate a summary a normalized term frequency vector is computed for each sentence. An average vector is then computed by summing these and renormalizing the result.

The computation of this average vector is biased by weighting earlier sentences vectors more when computing the sum of vectors. This is done using weight coming from a Zipf like distribution. Once an average sentence is obtained, then sentences are score against it using a residual cosine similarity score. I.e., the most important sentence is determined by cosine rank. Then the components of this sentence in the direction of the average sentence is deleted from the average sentence. and the next most important sentence is computed by ranking against this new average sentence vector and so on.

Tags
author

Charles Bocage (charles.bocage@sjsu.edu) rewritten Chris Pollett (chris@pollett.org)

Table of Contents

CENTROID_COMPONENTS  = 1000
Number of nonzero centroid components
MAX_DISTINCT_TERMS  = 1000
Number of distinct terms to use in generating summary
WORD_CLOUD_LEN  = 5
Number of words in word cloud
computeTermFrequenciesPerSentence()  : array<string|int, mixed>
Splits sentences into terms and returns [array of terms, array normalized term frequencies]
formatDoc()  : string
Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.
formatSentence()  : string
Formats the sentences to remove all characters except words, digits and spaces
getAverageSentence()  : array<string|int, mixed>
Computes an average sentence by adding the normalized term frequency vectors for each sentence weighted by a Zipf like distribution on sentence index and normalizing the resulting vector
getPunctuatedUnpunctuatedSentences()  : array<string|int, mixed>
Breaks any content into sentences with and without punctuation
getSentences()  : array<string|int, mixed>
Breaks any content into sentences by splitting it on spaces or carriage returns
getSummary()  : array<string|int, mixed>
Generates a summary, word cloud, and summary scores based on the closeness of normalized term frequency vectors to an average term frequency vector for sentences.
getSummaryFromSentenceScores()  : array<string|int, mixed>
Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.
getTermFrequencies()  : array<string|int, mixed>
Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences
getTermsFromSentences()  : array<string|int, mixed>
Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.
numSentencesForSummary()  : int
Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.
pageProcessing()  : string
This function does an additional processing on the page such as removing all the tags from the page
removePunctuation()  : array<string|int, mixed>
Remove punctuation from an array of sentences
removeStopWords()  : array<string|int, mixed>
Returns a new array of sentences without the stop words
scoreSentencesVersusAverage()  : array<string|int, mixed>
Computes scores for each sentence => word vector in an array of sentence => word_vectors based on on how it compares versus an average sentence word vector Here word vectors are normalized vectors and scores are determined by inner product.
wordCloudFromSummary()  : array<string|int, mixed>
Generates an array of most important words from a string $summary.
wordCloudFromTermVector()  : array<string|int, mixed>
Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms

Constants

CENTROID_COMPONENTS

Number of nonzero centroid components

public mixed CENTROID_COMPONENTS = 1000

MAX_DISTINCT_TERMS

Number of distinct terms to use in generating summary

public mixed MAX_DISTINCT_TERMS = 1000

WORD_CLOUD_LEN

Number of words in word cloud

public mixed WORD_CLOUD_LEN = 5

Methods

computeTermFrequenciesPerSentence()

Splits sentences into terms and returns [array of terms, array normalized term frequencies]

public static computeTermFrequenciesPerSentence(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
$sentences : array<string|int, mixed>

the array of sentences to process

$lang : string

the current locale

Return values
array<string|int, mixed>

an array with [array of terms, array normalized term frequencies] pairs

formatDoc()

Formats the document to remove carriage returns, hyphens and digits as we will not be using digits in word cloud.

public static formatDoc(string $content) : string

The formatted document generated by this function is only used to compute centroid.

Parameters
$content : string

formatted page.

Return values
string

formatted document.

formatSentence()

Formats the sentences to remove all characters except words, digits and spaces

public static formatSentence(string $sentence) : string
Parameters
$sentence : string

complete page.

Return values
string

formatted sentences.

getAverageSentence()

Computes an average sentence by adding the normalized term frequency vectors for each sentence weighted by a Zipf like distribution on sentence index and normalizing the resulting vector

public static getAverageSentence(array<string|int, mixed> $term_frequencies_normalized) : array<string|int, mixed>
Parameters
$term_frequencies_normalized : array<string|int, mixed>

the array with the terms as the key and its normalized frequency as the value

Return values
array<string|int, mixed>

a normalized vector of term => weights

getPunctuatedUnpunctuatedSentences()

Breaks any content into sentences with and without punctuation

public static getPunctuatedUnpunctuatedSentences(object $dom, string $content, string $lang) : array<string|int, mixed>
Parameters
$dom : object

a document object to extract a description from.

$content : string

complete page.

$lang : string

local tag of the language for data being processed

Return values
array<string|int, mixed>

array [sentences_with_punctuation, sentences_with_punctuation_stripped]

getSentences()

Breaks any content into sentences by splitting it on spaces or carriage returns

public static getSentences(string $content) : array<string|int, mixed>
Parameters
$content : string

complete page.

Return values
array<string|int, mixed>

array of sentences from that content.

getSummary()

Generates a summary, word cloud, and summary scores based on the closeness of normalized term frequency vectors to an average term frequency vector for sentences.

public static getSummary(object $dom, string $page, string $lang) : array<string|int, mixed>
Parameters
$dom : object

document object model of page to summarize

$page : string

complete raw page to generate the summary from.

$lang : string

language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Return values
array<string|int, mixed>

a triple (string summary, array word cloud, array of position => scores for positions within the summary)

getSummaryFromSentenceScores()

Given a score-sorted array of sentence index => score pairs and and a set of sentences, outputs a summary of up to a PageProcessor::$max_description_len based on the highest scored sentences concatenated in the order they appeared in the original document.

public static getSummaryFromSentenceScores(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
$sentence_scores : array<string|int, mixed>

an array sorted by score of sentence_index => score pairs.

$sentences : array<string|int, mixed>

the array of sentences corresponding to sentence $sentence_scores indices

$lang : string

language of the page to decide which stop words to call proper tokenizer.php of the specified language.

Return values
array<string|int, mixed>

a string that represents the summary, a vector of pairs (pos, score)

getTermFrequencies()

Calculates an array with key terms and values their frequencies based on a supplied sentence or sentences

public static getTermFrequencies(array<string|int, mixed> $terms, mixed $sentence_or_sentences) : array<string|int, mixed>
Parameters
$terms : array<string|int, mixed>

the list of all terms in the doc

$sentence_or_sentences : mixed

either a single string sentence or an array of sentences

Return values
array<string|int, mixed>

sequence of term => frequency pairs

getTermsFromSentences()

Get up to the top self::MAX_DISTINCT_TERMS terms from an array of sentences in order of term frequency.

public static getTermsFromSentences(array<string|int, mixed> $sentences, string $lang) : array<string|int, mixed>
Parameters
$sentences : array<string|int, mixed>

the sentences in the doc

$lang : string

locale tag for stemming

Return values
array<string|int, mixed>

an array of terms in the array of sentences

numSentencesForSummary()

Calculates how many sentences to put in the summary to match the MAX_DESCRIPTION_LEN.

public static numSentencesForSummary(array<string|int, mixed> $sentence_scores, array<string|int, mixed> $sentences) : int
Parameters
$sentence_scores : array<string|int, mixed>

associative array of sentence-number-in-doc => similarity score to centroid (sorted from highest to lowest score).

$sentences : array<string|int, mixed>

sentences in doc in their original order

Return values
int

number of sentences

pageProcessing()

This function does an additional processing on the page such as removing all the tags from the page

public static pageProcessing(string $page) : string
Parameters
$page : string

complete page.

Return values
string

processed page.

removePunctuation()

Remove punctuation from an array of sentences

public static removePunctuation(array<string|int, mixed> $sentences) : array<string|int, mixed>
Parameters
$sentences : array<string|int, mixed>

the sentences in the doc

Return values
array<string|int, mixed>

the array of sentences with the punctuation removed

removeStopWords()

Returns a new array of sentences without the stop words

public static removeStopWords(array<string|int, mixed> $sentences, object $stop_obj) : array<string|int, mixed>
Parameters
$sentences : array<string|int, mixed>

the array of sentences to process

$stop_obj : object

the class that has the stopworedRemover method

Return values
array<string|int, mixed>

a new array of sentences without the stop words

scoreSentencesVersusAverage()

Computes scores for each sentence => word vector in an array of sentence => word_vectors based on on how it compares versus an average sentence word vector Here word vectors are normalized vectors and scores are determined by inner product.

public static scoreSentencesVersusAverage(array<string|int, mixed> $sentence_vectors, array<string|int, mixed> $average_sentence) : array<string|int, mixed>
Parameters
$sentence_vectors : array<string|int, mixed>

the array with the terms as the key and its normalized frequency as the value

$average_sentence : array<string|int, mixed>

an array of each words average frequency value

Return values
array<string|int, mixed>

array of sentence index => score pairs

wordCloudFromSummary()

Generates an array of most important words from a string $summary.

public static wordCloudFromSummary(string $summary, string $lang[, array<string|int, mixed> $term_frequencies = null ]) : array<string|int, mixed>

Currently, the algorithm is a based on terms frequencies after stopwords removed

Parameters
$summary : string

text to derive most important words of

$lang : string

locale tag for language of $summary

$term_frequencies : array<string|int, mixed> = null

a supplied list of terms and frequencies for words in summary. If null then these will be computed.

Return values
array<string|int, mixed>

the top self::WORD_CLOUD_LEN most important terms in $summary

wordCloudFromTermVector()

Given a sorted term vector for a document computes a word cloud of the most important self::WORD_CLOUD_LEN many terms

public static wordCloudFromTermVector(array<string|int, mixed> $term_vector[, mixed $terms = false ]) : array<string|int, mixed>
Parameters
$term_vector : array<string|int, mixed>

if $terms is false then centroid is expected a sequence of pairs term => weight, otherwise, if $terms is an array of terms, then $term_vector should be a sequence of term_index=>weight pairs.

$terms : mixed = false

if not false, then should be an array of terms, at a minimum having all the indices of $term_vector

Return values
array<string|int, mixed>

the top self::WORD_CLOUD_LEN most important terms in $summary


        

Search results