Yioop_V9.5_Source_Code_Documentation

WikiParser
in package
implements CrawlConstants

Class with methods to parse mediawiki documents, both within Yioop, and when Yioop indexes mediawiki dumps as from Wikipedia.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$base_address  : string
Base url address to be used in urls that occur in wiki substitutions
$braces_matches  : array<string|int, mixed>
Regex patterns for wiki syntax involving braces
$braces_replaces  : array<string|int, mixed>
HTML substitutions for the wiki syntax given in $braces_matches
$esc  : string
Escape string to try to prevent incorrect nesting of div for some of the substitutions;
$link_matches  : array<string|int, mixed>
Regex patterns for wiki syntax involving links
$link_replaces  : array<string|int, mixed>
HTML substitutions for the wiki syntax given in $link_matches
$matches  : array<string|int, mixed>
Regex patterns for common wiki syntax
$minimal  : bool
Whether the parser should be configured only to do minimal substitutions or all available (minimal might be used for posts in discussion groups)
$replaces  : array<string|int, mixed>
HTML substitutions for the wiki syntax given in $matches
__construct()  : mixed
Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)
cleanLinksAndParagraphs()  : string
Replaces with underscores, links with spaces, fixes newline issues within span tags
fetchLinks()  : array<string|int, mixed>
Fetches internal links from wiki syntax.
insertReferences()  : string
After regex processing has been done on a wiki page this function inserts into the resulting page a reference at {{reflist locations, then returns the result page
insertTableOfContents()  : string
After regex processing has been done on a wiki page this function inserts into the resulting page a table of contents just before the first h2 tag, then returns the result page
makeReferences()  : string
Used to make a reference list for a wiki page based on the cite tags on that page.
makeTableOfContents()  : string
Used to make a table of contents for a wiki page based on the level two headings on that page.
parse()  : string
Parses a mediawiki document to produce an HTML equivalent
processProvidedRegexes()  : string
Applies a set of transformations from wiki syntax to html to a document
processRegexes()  : string
Applies all the wiki substitutions of this WikiParser to the document to create an html document makes use of @see processProvidedRegexes

Properties

$base_address

Base url address to be used in urls that occur in wiki substitutions

public string $base_address

$braces_matches

Regex patterns for wiki syntax involving braces

public array<string|int, mixed> $braces_matches

$braces_replaces

HTML substitutions for the wiki syntax given in $braces_matches

public array<string|int, mixed> $braces_replaces

$esc

Escape string to try to prevent incorrect nesting of div for some of the substitutions;

public string $esc = ",[}"

Regex patterns for wiki syntax involving links

public array<string|int, mixed> $link_matches

HTML substitutions for the wiki syntax given in $link_matches

public array<string|int, mixed> $link_replaces

$matches

Regex patterns for common wiki syntax

public array<string|int, mixed> $matches

$minimal

Whether the parser should be configured only to do minimal substitutions or all available (minimal might be used for posts in discussion groups)

public bool $minimal

$replaces

HTML substitutions for the wiki syntax given in $matches

public array<string|int, mixed> $replaces

Methods

__construct()

Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)

public __construct([string $base_address = "" ][, array<string|int, mixed> $add_substitutions = [] ][, bool $minimal = false ]) : mixed
Parameters
$base_address : string = ""

base url for link substitutions

$add_substitutions : array<string|int, mixed> = []

additional wiki rule substitutions in addition to the default ones that should be used by this wiki parser

$minimal : bool = false

substitution list is shorter - suitable for posting to discussion

Return values
mixed

cleanLinksAndParagraphs()

Replaces with underscores, links with spaces, fixes newline issues within span tags

public cleanLinksAndParagraphs(string $document) : string
Parameters
$document : string

wiki document to fix

Return values
string

document after substitutions

Fetches internal links from wiki syntax.

public fetchLinks(array<string|int, mixed> $document) : array<string|int, mixed>
Parameters
$document : array<string|int, mixed>

a wiki document

Return values
array<string|int, mixed>

of linked page names in the format page_name|relationship_type

insertReferences()

After regex processing has been done on a wiki page this function inserts into the resulting page a reference at {{reflist locations, then returns the result page

public insertReferences(string $page, string $references) : string
Parameters
$page : string

page in which to insert the reference lists

$references : string

HTML table of contents

Return values
string

resulting page after insert

insertTableOfContents()

After regex processing has been done on a wiki page this function inserts into the resulting page a table of contents just before the first h2 tag, then returns the result page

public insertTableOfContents(string $page, string $toc) : string
Parameters
$page : string

page in which to insert table of contents

$toc : string

HTML table of contents

Return values
string

resulting page after insert

makeReferences()

Used to make a reference list for a wiki page based on the cite tags on that page.

public makeReferences(string $page) : string
Parameters
$page : string

a wiki document

Return values
string

HTML reference list to be inserted after wiki page processed

makeTableOfContents()

Used to make a table of contents for a wiki page based on the level two headings on that page.

public makeTableOfContents(string $page) : string
Parameters
$page : string

a wiki document

Return values
string

HTML table of contents to be inserted after wiki page processed

parse()

Parses a mediawiki document to produce an HTML equivalent

public parse(string $document[, bool $parse_head_vars = true ][, bool $handle_big_files = false ]) : string
Parameters
$document : string

a document which might have mediawiki markup

$parse_head_vars : bool = true

header variables are an extension of mediawiki syntax used to add meta variable and titles to the head tag of an html document. This flag controls whether to support this extension or not

$handle_big_files : bool = false

for indexing purposes Yioop by default truncates long documents before indexing them. If true, this method does not do this default truncation. The true value is more useful when using Yioop's built-in wiki.

Return values
string

HTML document obtained by parsing mediawiki markup in $document

processProvidedRegexes()

Applies a set of transformations from wiki syntax to html to a document

public processProvidedRegexes(array<string|int, mixed> $matches, array<string|int, mixed> $replaces, string $document) : string
Parameters
$matches : array<string|int, mixed>

an array of things to match for

$replaces : array<string|int, mixed>

what to replace matches with

$document : string

wiki document to fix

Return values
string

document after substitutions

processRegexes()

Applies all the wiki substitutions of this WikiParser to the document to create an html document makes use of @see processProvidedRegexes

public processRegexes(string $document) : string
Parameters
$document : string

a document with wiki syntax

Return values
string

result of subistutions to make html


        

Search results