Yioop_V9.5_Source_Code_Documentation

UrlParser
in package

Library of functions used to manipulate and to extract components from urls

Tags
author

Chris Pollett

Table of Contents

canonicalLink()  : string
Given a $link that was obtained from a website $site, returns a complete URL for that link.
checkRecursiveUrl()  : bool
Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times
cleanRedundantLinks()  : array<string|int, mixed>
Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)
countCompanyLevelDomainsInCommonDetectFarm()  : int
Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.
cullByDomainFilter()  : bool
Checks if a url's host is either a company level domain (a cld) or is of the form www.cld or has as its cld a domain that is in one of the supplied BloomFilterFile objects
extractTextFromUrl()  : string
Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents
getBaseDomain()  : string
Gets the domain of a url less any leading www
getCompanyLevelDomain()  : string
Calculates the company level domain for the given url
getDocumentFilename()  : string
Gets the filename portion of a url if present; otherwise returns "Some File"
getDocumentType()  : string
Given a url, makes a guess at the file type of the file it points to
getFragment()  : string
Get the url fragment string component of a url
getHost()  : the
Get the host name portion of a url if present; if not return false
getHostAndPath()  : array<string|int, mixed>
Returns as a two element array the host and path of a url
getHostPaths()  : array<string|int, mixed>
Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url
getHostSubdomains()  : array<string|int, mixed>
Gets the subdomains of the host portion of a url. So
getLang()  : the
Attempts to guess the language tag based on url
getPath()  : the
Get the path portion of a url if present; if not return null
getPort()  : int
Get the port number of a url if present; if not return 80
getQuery()  : string
Get the query string component of a url
getScheme()  : int
Get the scheme of a url if present; if not return http
getWordsInHostUrl()  : string
Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .
getWordsLastPathPartUrl()  : string
Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "
guessFileSizeFromUrl()  : int
Used to guess the file size in bytes of the file that a url is pointed at based on its file type.
guessMimeTypeFromFileName()  : string
Guess mime type based on extension of the file
hasHostUrl()  : bool
Checks if the url has a host part.
isLocalhostUrl()  : bool
Checks if a $url is on localhost
isPathMemberRegexPaths()  : bool
Checks if $path matches against any of the Robots.txt style regex paths in $paths
isSchemeCrawlable()  : bool
Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).
pruneLinks()  : array<string|int, mixed>
Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.
simplifyUrl()  : string
Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary
urlMemberSiteArray()  : mixed
Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.

Methods

Given a $link that was obtained from a website $site, returns a complete URL for that link.

public static canonicalLink(string $link, string $site[, string $no_fragment = true ]) : string

For example, the $link some_dir/test.html on the $site http://www.somewhere.com/bob would yield the complete url http://www.somewhere.com/bob/some_dir/test.html

Parameters
$link : string

a relative or complete url

$site : string

a base url

$no_fragment : string = true

if false then if the url had a fragment (#link_within_page) then the fragment will be included

Return values
string

a complete url based on these two pieces of information

checkRecursiveUrl()

Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times

public static checkRecursiveUrl(string $url[, int $repeat_threshold = 3 ]) : bool

A pattern like bob/.../bob counts as own repetition. bob/.../alice/.../bob/.../alice would count as two (... should be read as ellipsis, not a directory name).If the threshold is three and there are at least three repeated matches this function return true; it returns false otherwise.

Parameters
$url : string

the url to check

$repeat_threshold : int = 3

the number of repeats of a subdir name to trigger a true response

Return values
bool

whether a repeated subdirectory name with more matches than the threshold was found

Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)

public static cleanRedundantLinks(array<string|int, mixed> $links, string $parent_url) : array<string|int, mixed>
Parameters
$links : array<string|int, mixed>

pairs of the form $link =>$link_info

$parent_url : string

a site that the links were found on

Return values
array<string|int, mixed>

just those links which pass the relevancy test

countCompanyLevelDomainsInCommonDetectFarm()

Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.

public static countCompanyLevelDomainsInCommonDetectFarm(string $url, array<string|int, mixed> $links[, int $threshold = 200 ]) : int

If this is detected this method returns -1. This method also returns -1 if (2) there seem to be lots of links ($threshold) from the current domain to a single domain that shares the same company domain. This might indicate a domain md5_hash.foo.com with lots of links to a domain www.foo.com

Parameters
$url : string

the url to compare against $links

$links : array<string|int, mixed>

an array of urls

$threshold : int = 200

number above which if either situation (1) or (2) above happens then deem site spam

Return values
int

the number of times $url shares the cld with a link in $links. If thinks part of link farm returns -1

cullByDomainFilter()

Checks if a url's host is either a company level domain (a cld) or is of the form www.cld or has as its cld a domain that is in one of the supplied BloomFilterFile objects

public static cullByDomainFilter(string $url, array<string|int, mixed> $filters) : bool
Parameters
$url : string

url to check if in above form

$filters : array<string|int, mixed>

array of BloomFilterFile objects

Return values
bool

whether or not url has above form

extractTextFromUrl()

Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents

public static extractTextFromUrl(string $url) : string
Parameters
$url : string

to find text that might say what link is about

Return values
string

heuristically derived text.

getBaseDomain()

Gets the domain of a url less any leading www

public static getBaseDomain(string $url) : string
Parameters
$url : string

to get domain of

Return values
string

the base domain as defined above

getCompanyLevelDomain()

Calculates the company level domain for the given url

public static getCompanyLevelDomain(string $url) : string

For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations.

Parameters
$url : string

url to determine cld for

Return values
string

the cld of $url

getDocumentFilename()

Gets the filename portion of a url if present; otherwise returns "Some File"

public static getDocumentFilename(string $url) : string
Parameters
$url : string

a url to parse

Return values
string

the filename portion of this url

getDocumentType()

Given a url, makes a guess at the file type of the file it points to

public static getDocumentType(string $url[, string $default = "html" ]) : string
Parameters
$url : string

a url to figure out the file type for

$default : string = "html"

default type to be returned in the case that document type cannot be determined from the url, defaults to html

Return values
string

the guessed file type.

getFragment()

Get the url fragment string component of a url

public static getFragment(string $url) : string
Parameters
$url : string

a url to get the url fragment string out of

Return values
string

the url fragment string if present; null otherwise

getHost()

Get the host name portion of a url if present; if not return false

public static getHost(string $url[, bool $with_login_and_port = true ]) : the
Parameters
$url : string

the url to parse

$with_login_and_port : bool = true

whether to include user,password,port if present

Return values
the

host portion of the url if present; false otherwise

getHostAndPath()

Returns as a two element array the host and path of a url

public static getHostAndPath(string $url[, bool $with_login_and_port = true ][, bool $with_query_string = false ]) : array<string|int, mixed>
Parameters
$url : string

initial url to get host and path of

$with_login_and_port : bool = true

controls whether the host should should contain login and port info

$with_query_string : bool = false

says whether the path should contain the query string as well

Return values
array<string|int, mixed>

host and the path as a pair

getHostPaths()

Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url

public static getHostPaths(string $url) : array<string|int, mixed>

http://host.com/b/c/ would yield http://host.com/ , http://host.com/b, http://host.com/b/, http://host.com/b/c, http://host.com/b/c/

Parameters
$url : string

the url to extract prefixes from

Return values
array<string|int, mixed>

the array of url prefixes

getHostSubdomains()

Gets the subdomains of the host portion of a url. So

public static getHostSubdomains(string $url) : array<string|int, mixed>

http://a.b.c/d/f/ will return a.b.c, .a.b.c, b.c, .b.c, c, .c

Parameters
$url : string

the url to extract prefixes from

Return values
array<string|int, mixed>

the array of url prefixes

getLang()

Attempts to guess the language tag based on url

public static getLang(string $url) : the
Parameters
$url : string

the url to parse

Return values
the

top level domain if present; false otherwise

getPath()

Get the path portion of a url if present; if not return null

public static getPath(string $url[, bool $with_query_string = false ]) : the
Parameters
$url : string

the url to parse

$with_query_string : bool = false

(whether to also include the query string at the end of the path)

Return values
the

host portion of the url if present; null otherwise

getPort()

Get the port number of a url if present; if not return 80

public static getPort(string $url) : int
Parameters
$url : string

the url to extract port number from

Return values
int

a port number

getQuery()

Get the query string component of a url

public static getQuery(string $url) : string
Parameters
$url : string

a url to get the query string out of

Return values
string

the query string if present; null otherwise

getScheme()

Get the scheme of a url if present; if not return http

public static getScheme(string $url) : int
Parameters
$url : string

the url to extract scheme from

Return values
int

a port number

getWordsInHostUrl()

Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .

public static getWordsInHostUrl(string $url) : string

Ignores a leading www and also ignore tld.

For example, "http://www.yahoo.com/" returns " yahoo "

Parameters
$url : string

a url to figure out the file type for

Return values
string

space separated words extracted.

getWordsLastPathPartUrl()

Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "

public static getWordsLastPathPartUrl(string $url) : string
Parameters
$url : string

a url to figure out the file type for

Return values
string

space separated words extracted.

guessFileSizeFromUrl()

Used to guess the file size in bytes of the file that a url is pointed at based on its file type.

public static guessFileSizeFromUrl(string $url) : int
Parameters
$url : string

to estimate the size of

Return values
int

estimated number of bytes

guessMimeTypeFromFileName()

Guess mime type based on extension of the file

public static guessMimeTypeFromFileName(string $file_name[, string $default = 'text/plain' ]) : string
Parameters
$file_name : string

name of the file

$default : string = 'text/plain'

what mime type to return if mime type couldn't be determined

Return values
string

$mime_type for the given file name

hasHostUrl()

Checks if the url has a host part.

public static hasHostUrl(string $url) : bool
Parameters
$url : string

the url to check

Return values
bool

true if it does; false otherwise

isLocalhostUrl()

Checks if a $url is on localhost

public static isLocalhostUrl(string $url) : bool
Parameters
$url : string

the url to check

Return values
bool

whether or not it is on localhost

isPathMemberRegexPaths()

Checks if $path matches against any of the Robots.txt style regex paths in $paths

public static isPathMemberRegexPaths(string $path, array<string|int, mixed> $robot_paths) : bool
Parameters
$path : string

a path component of a url

$robot_paths : array<string|int, mixed>

in format of robots.txt regex paths

Return values
bool

whether it is a member or not

isSchemeCrawlable()

Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).

public static isSchemeCrawlable(string $url) : bool
Parameters
$url : string

the url to check

Return values
bool

returns true if it is either http,https, or gopher and false otherwise

Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.

public static pruneLinks(array<string|int, mixed> $links[, int $max_links = CMAX_LINKS_TO_EXTRACT ]) : array<string|int, mixed>

To compute this, we count the number of terms by splitting on white space. We then multiply this by the ratio of the compressed length of the text divided by its uncompressed length.

Parameters
$links : array<string|int, mixed>

list of pairs $url=>$text

$max_links : int = CMAX_LINKS_TO_EXTRACT

maximum number of links from $links to return

Return values
array<string|int, mixed>

$out_links extracted from $links according to the description above.

simplifyUrl()

Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary

public static simplifyUrl(string $url, int $max_len) : string
Parameters
$url : string

the url to trim

$max_len : int

length to shorten url to, 0 = no shortening

Return values
string

the trimmed url

urlMemberSiteArray()

Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.

public static urlMemberSiteArray(string $url, array<string|int, mixed> $site_array, string $name[, bool $return_rule = false ]) : mixed
Parameters
$url : string

url to check

$site_array : array<string|int, mixed>

sites to check against

$name : string

identifier to store $site_array with in this public function's cache

$return_rule : bool = false

whether when a match is found to return true or to return the matching site rule

Return values
mixed

whether the url belongs to one of the sites


        

Search results