WikiParser
in package
implements
CrawlConstants
Class with methods to parse mediawiki documents, both within Yioop, and when Yioop indexes mediawiki dumps as from Wikipedia.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $base_address : string
- Base url address to be used in urls that occur in wiki substitutions
- $braces_matches : array<string|int, mixed>
- Regex patterns for wiki syntax involving braces
- $braces_replaces : array<string|int, mixed>
- HTML substitutions for the wiki syntax given in $braces_matches
- $esc : string
- Escape string to try to prevent incorrect nesting of div for some of the substitutions;
- $link_matches : array<string|int, mixed>
- Regex patterns for wiki syntax involving links
- $link_replaces : array<string|int, mixed>
- HTML substitutions for the wiki syntax given in $link_matches
- $matches : array<string|int, mixed>
- Regex patterns for common wiki syntax
- $minimal : bool
- Whether the parser should be configured only to do minimal substitutions or all available (minimal might be used for posts in discussion groups)
- $replaces : array<string|int, mixed>
- HTML substitutions for the wiki syntax given in $matches
- __construct() : mixed
- Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)
- cleanLinksAndParagraphs() : string
- Replaces with underscores, links with spaces, fixes newline issues within span tags
- fetchLinks() : array<string|int, mixed>
- Fetches internal links from wiki syntax.
- insertReferences() : string
- After regex processing has been done on a wiki page this function inserts into the resulting page a reference at {{reflist locations, then returns the result page
- insertTableOfContents() : string
- After regex processing has been done on a wiki page this function inserts into the resulting page a table of contents just before the first h2 tag, then returns the result page
- makeReferences() : string
- Used to make a reference list for a wiki page based on the cite tags on that page.
- makeTableOfContents() : string
- Used to make a table of contents for a wiki page based on the level two headings on that page.
- parse() : string
- Parses a mediawiki document to produce an HTML equivalent
- processProvidedRegexes() : string
- Applies a set of transformations from wiki syntax to html to a document
- processRegexes() : string
- Applies all the wiki substitutions of this WikiParser to the document to create an html document makes use of @see processProvidedRegexes
Properties
$base_address
Base url address to be used in urls that occur in wiki substitutions
public
string
$base_address
$braces_matches
Regex patterns for wiki syntax involving braces
public
array<string|int, mixed>
$braces_matches
$braces_replaces
HTML substitutions for the wiki syntax given in $braces_matches
public
array<string|int, mixed>
$braces_replaces
$esc
Escape string to try to prevent incorrect nesting of div for some of the substitutions;
public
string
$esc
= ",[}"
$link_matches
Regex patterns for wiki syntax involving links
public
array<string|int, mixed>
$link_matches
$link_replaces
HTML substitutions for the wiki syntax given in $link_matches
public
array<string|int, mixed>
$link_replaces
$matches
Regex patterns for common wiki syntax
public
array<string|int, mixed>
$matches
$minimal
Whether the parser should be configured only to do minimal substitutions or all available (minimal might be used for posts in discussion groups)
public
bool
$minimal
$replaces
HTML substitutions for the wiki syntax given in $matches
public
array<string|int, mixed>
$replaces
Methods
__construct()
Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)
public
__construct([string $base_address = "" ][, array<string|int, mixed> $add_substitutions = [] ][, bool $minimal = false ]) : mixed
Parameters
- $base_address : string = ""
-
base url for link substitutions
- $add_substitutions : array<string|int, mixed> = []
-
additional wiki rule substitutions in addition to the default ones that should be used by this wiki parser
- $minimal : bool = false
-
substitution list is shorter - suitable for posting to discussion
Return values
mixed —cleanLinksAndParagraphs()
Replaces with underscores, links with spaces, fixes newline issues within span tags
public
cleanLinksAndParagraphs(string $document) : string
Parameters
- $document : string
-
wiki document to fix
Return values
string —document after substitutions
fetchLinks()
Fetches internal links from wiki syntax.
public
fetchLinks(array<string|int, mixed> $document) : array<string|int, mixed>
Parameters
- $document : array<string|int, mixed>
-
a wiki document
Return values
array<string|int, mixed> —of linked page names in the format page_name|relationship_type
insertReferences()
After regex processing has been done on a wiki page this function inserts into the resulting page a reference at {{reflist locations, then returns the result page
public
insertReferences(string $page, string $references) : string
Parameters
- $page : string
-
page in which to insert the reference lists
- $references : string
-
HTML table of contents
Return values
string —resulting page after insert
insertTableOfContents()
After regex processing has been done on a wiki page this function inserts into the resulting page a table of contents just before the first h2 tag, then returns the result page
public
insertTableOfContents(string $page, string $toc) : string
Parameters
- $page : string
-
page in which to insert table of contents
- $toc : string
-
HTML table of contents
Return values
string —resulting page after insert
makeReferences()
Used to make a reference list for a wiki page based on the cite tags on that page.
public
makeReferences(string $page) : string
Parameters
- $page : string
-
a wiki document
Return values
string —HTML reference list to be inserted after wiki page processed
makeTableOfContents()
Used to make a table of contents for a wiki page based on the level two headings on that page.
public
makeTableOfContents(string $page) : string
Parameters
- $page : string
-
a wiki document
Return values
string —HTML table of contents to be inserted after wiki page processed
parse()
Parses a mediawiki document to produce an HTML equivalent
public
parse(string $document[, bool $parse_head_vars = true ][, bool $handle_big_files = false ]) : string
Parameters
- $document : string
-
a document which might have mediawiki markup
- $parse_head_vars : bool = true
-
header variables are an extension of mediawiki syntax used to add meta variable and titles to the head tag of an html document. This flag controls whether to support this extension or not
- $handle_big_files : bool = false
-
for indexing purposes Yioop by default truncates long documents before indexing them. If true, this method does not do this default truncation. The true value is more useful when using Yioop's built-in wiki.
Return values
string —HTML document obtained by parsing mediawiki markup in $document
processProvidedRegexes()
Applies a set of transformations from wiki syntax to html to a document
public
processProvidedRegexes(array<string|int, mixed> $matches, array<string|int, mixed> $replaces, string $document) : string
Parameters
- $matches : array<string|int, mixed>
-
an array of things to match for
- $replaces : array<string|int, mixed>
-
what to replace matches with
- $document : string
-
wiki document to fix
Return values
string —document after substitutions
processRegexes()
Applies all the wiki substitutions of this WikiParser to the document to create an html document makes use of @see processProvidedRegexes
public
processRegexes(string $document) : string
Parameters
- $document : string
-
a document with wiki syntax
Return values
string —result of subistutions to make html