Yioop_V9.5_Source_Code

UrlParser
in package

Application

Library of functions used to manipulate and to extract components from urls

canonicalLink()

Given a $link that was obtained from a website $site, returns a complete URL for that link.


    public
            static        canonicalLink(string $link, string $site[, string $no_fragment = true ]) : string

For example, the $link some_dir/test.html on the $site http://www.somewhere.com/bob would yield the complete url http://www.somewhere.com/bob/some_dir/test.html

Parameters

$link : string: a relative or complete url
$site : string: a base url
$no_fragment : string = true: if false then if the url had a fragment (#link_within_page) then the fragment will be included

Return values

string —

a complete url based on these two pieces of information

checkRecursiveUrl()

Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times


    public
            static        checkRecursiveUrl(string $url[, int $repeat_threshold = 3 ]) : bool

A pattern like bob/.../bob counts as own repetition. bob/.../alice/.../bob/.../alice would count as two (... should be read as ellipsis, not a directory name).If the threshold is three and there are at least three repeated matches this function return true; it returns false otherwise.

Parameters

$url : string: the url to check
$repeat_threshold : int = 3: the number of repeats of a subdir name to trigger a true response

Return values

bool —

whether a repeated subdirectory name with more matches than the threshold was found

cleanRedundantLinks()

Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)


    public
            static        cleanRedundantLinks(array<string|int, mixed> $links, string $parent_url) : array<string|int, mixed>

Parameters

$links : array<string|int, mixed>: pairs of the form $link =>$link_info
$parent_url : string: a site that the links were found on

Return values

array<string|int, mixed> —

just those links which pass the relevancy test

countCompanyLevelDomainsInCommonDetectFarm()

Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.


    public
            static        countCompanyLevelDomainsInCommonDetectFarm(string $url, array<string|int, mixed> $links[, int $threshold = 200 ]) : int

If this is detected this method returns -1. This method also returns -1 if (2) there seem to be lots of links ($threshold) from the current domain to a single domain that shares the same company domain. This might indicate a domain md5_hash.foo.com with lots of links to a domain www.foo.com

Parameters

$url : string: the url to compare against $links
$links : array<string|int, mixed>: an array of urls
$threshold : int = 200: number above which if either situation (1) or (2) above happens then deem site spam

Return values

int —

the number of times $url shares the cld with a link in $links. If thinks part of link farm returns -1

cullByDomainFilter()

Checks if a url's host is either a company level domain (a cld) or is of the form www.cld or has as its cld a domain that is in one of the supplied BloomFilterFile objects


    public
            static        cullByDomainFilter(string $url, array<string|int, mixed> $filters) : bool

Parameters

$url : string: url to check if in above form
$filters : array<string|int, mixed>: array of BloomFilterFile objects

Return values

bool —

whether or not url has above form

extractTextFromUrl()

Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents


    public
            static        extractTextFromUrl(string $url) : string

Parameters

$url : string: to find text that might say what link is about

Return values

string —

heuristically derived text.

getBaseDomain()

Gets the domain of a url less any leading www


    public
            static        getBaseDomain(string $url) : string

Parameters

$url : string: to get domain of

Return values

string —

the base domain as defined above

getCompanyLevelDomain()

Calculates the company level domain for the given url


    public
            static        getCompanyLevelDomain(string $url) : string

For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations.

Parameters

$url : string: url to determine cld for

Return values

string —

the cld of $url

getDocumentFilename()

Gets the filename portion of a url if present; otherwise returns "Some File"


    public
            static        getDocumentFilename(string $url) : string

Parameters

$url : string: a url to parse

Return values

string —

the filename portion of this url

getDocumentType()

Given a url, makes a guess at the file type of the file it points to


    public
            static        getDocumentType(string $url[, string $default = "html" ]) : string

Parameters

$url : string: a url to figure out the file type for
$default : string = "html": default type to be returned in the case that document type cannot be determined from the url, defaults to html

Return values

string —

the guessed file type.

getFragment()

Get the url fragment string component of a url


    public
            static        getFragment(string $url) : string

Parameters

$url : string: a url to get the url fragment string out of

Return values

string —

the url fragment string if present; null otherwise

getHost()

Get the host name portion of a url if present; if not return false


    public
            static        getHost(string $url[, bool $with_login_and_port = true ]) : the

Parameters

$url : string: the url to parse
$with_login_and_port : bool = true: whether to include user,password,port if present

Return values

the —

host portion of the url if present; false otherwise

getHostAndPath()

Returns as a two element array the host and path of a url


    public
            static        getHostAndPath(string $url[, bool $with_login_and_port = true ][, bool $with_query_string = false ]) : array<string|int, mixed>

Parameters

$url : string: initial url to get host and path of
$with_login_and_port : bool = true: controls whether the host should should contain login and port info
$with_query_string : bool = false: says whether the path should contain the query string as well

Return values

array<string|int, mixed> —

host and the path as a pair

getHostPaths()

Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url


    public
            static        getHostPaths(string $url) : array<string|int, mixed>

http://host.com/b/c/ would yield http://host.com/ , http://host.com/b, http://host.com/b/, http://host.com/b/c, http://host.com/b/c/

Parameters

$url : string: the url to extract prefixes from

Return values

array<string|int, mixed> —

the array of url prefixes

getHostSubdomains()

Gets the subdomains of the host portion of a url. So


    public
            static        getHostSubdomains(string $url) : array<string|int, mixed>

http://a.b.c/d/f/ will return a.b.c, .a.b.c, b.c, .b.c, c, .c

Parameters

$url : string: the url to extract prefixes from

Return values

array<string|int, mixed> —

the array of url prefixes

getLang()

Attempts to guess the language tag based on url


    public
            static        getLang(string $url) : the

Parameters

$url : string: the url to parse

Return values

the —

top level domain if present; false otherwise

getPath()

Get the path portion of a url if present; if not return null


    public
            static        getPath(string $url[, bool $with_query_string = false ]) : the

Parameters

$url : string: the url to parse
$with_query_string : bool = false: (whether to also include the query string at the end of the path)

Return values

the —

host portion of the url if present; null otherwise

getPort()

Get the port number of a url if present; if not return 80


    public
            static        getPort(string $url) : int

Parameters

$url : string: the url to extract port number from

Return values

int —

a port number

getQuery()

Get the query string component of a url


    public
            static        getQuery(string $url) : string

Parameters

$url : string: a url to get the query string out of

Return values

string —

the query string if present; null otherwise

getScheme()

Get the scheme of a url if present; if not return http


    public
            static        getScheme(string $url) : int

Parameters

$url : string: the url to extract scheme from

Return values

int —

a port number

getWordsInHostUrl()

Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .


    public
            static        getWordsInHostUrl(string $url) : string

Ignores a leading www and also ignore tld.

For example, "http://www.yahoo.com/" returns " yahoo "

Parameters

$url : string: a url to figure out the file type for

Return values

string —

space separated words extracted.

getWordsLastPathPartUrl()

Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "


    public
            static        getWordsLastPathPartUrl(string $url) : string

Parameters

$url : string: a url to figure out the file type for

Return values

string —

space separated words extracted.

guessFileSizeFromUrl()

Used to guess the file size in bytes of the file that a url is pointed at based on its file type.


    public
            static        guessFileSizeFromUrl(string $url) : int

Parameters

$url : string: to estimate the size of

Return values

int —

estimated number of bytes

guessMimeTypeFromFileName()

Guess mime type based on extension of the file


    public
            static        guessMimeTypeFromFileName(string $file_name[, string $default = 'text/plain' ]) : string

Parameters

$file_name : string: name of the file
$default : string = 'text/plain': what mime type to return if mime type couldn't be determined

Return values

string —

$mime_type for the given file name

hasHostUrl()

Checks if the url has a host part.


    public
            static        hasHostUrl(string $url) : bool

Parameters

$url : string: the url to check

Return values

bool —

true if it does; false otherwise

isLocalhostUrl()

Checks if a $url is on localhost


    public
            static        isLocalhostUrl(string $url) : bool

Parameters

$url : string: the url to check

Return values

bool —

whether or not it is on localhost

isPathMemberRegexPaths()

Checks if $path matches against any of the Robots.txt style regex paths in $paths


    public
            static        isPathMemberRegexPaths(string $path, array<string|int, mixed> $robot_paths) : bool

Parameters

$path : string: a path component of a url
$robot_paths : array<string|int, mixed>: in format of robots.txt regex paths

Return values

bool —

whether it is a member or not

isSchemeCrawlable()

Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).


    public
            static        isSchemeCrawlable(string $url) : bool

Parameters

$url : string: the url to check

Return values

bool —

returns true if it is either http,https, or gopher and false otherwise

pruneLinks()

Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.


    public
            static        pruneLinks(array<string|int, mixed> $links[, int $max_links = CMAX_LINKS_TO_EXTRACT ]) : array<string|int, mixed>

To compute this, we count the number of terms by splitting on white space. We then multiply this by the ratio of the compressed length of the text divided by its uncompressed length.

Parameters

$links : array<string|int, mixed>: list of pairs $url=>$text
$max_links : int = CMAX_LINKS_TO_EXTRACT: maximum number of links from $links to return

Return values

array<string|int, mixed> —

$out_links extracted from $links according to the description above.

simplifyUrl()

Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary


    public
            static        simplifyUrl(string $url, int $max_len) : string

Parameters

$url : string: the url to trim
$max_len : int: length to shorten url to, 0 = no shortening

Return values

string —

the trimmed url

urlMemberSiteArray()

Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.


    public
            static        urlMemberSiteArray(string $url, array<string|int, mixed> $site_array, string $name[, bool $return_rule = false ]) : mixed

Parameters

$url : string: url to check
$site_array : array<string|int, mixed>: sites to check against
$name : string: identifier to store $site_array with in this public function's cache
$return_rule : bool = false: whether when a match is found to return true or to return the matching site rule

Return values

mixed —

whether the url belongs to one of the sites

UrlParser in package Application

Tags

Table of Contents

Methods

canonicalLink()

Parameters

Return values

checkRecursiveUrl()

Parameters

Return values

cleanRedundantLinks()

Parameters

Return values

countCompanyLevelDomainsInCommonDetectFarm()

Parameters

Return values

cullByDomainFilter()

Parameters

Return values

extractTextFromUrl()

Parameters

Return values

getBaseDomain()

Parameters

Return values

getCompanyLevelDomain()

Parameters

Return values

getDocumentFilename()

Parameters

Return values

getDocumentType()

Parameters

Return values

getFragment()

Parameters

Return values

getHost()

Parameters

Return values

getHostAndPath()

Parameters

Return values

getHostPaths()

Parameters

Return values

getHostSubdomains()

Parameters

Return values

getLang()

Parameters

Return values

getPath()

Parameters

Return values

getPort()

Parameters

Return values

getQuery()

Parameters

Return values

getScheme()

Parameters

Return values

getWordsInHostUrl()

Parameters

Return values

getWordsLastPathPartUrl()

Parameters

Return values

guessFileSizeFromUrl()

Parameters

Return values

guessMimeTypeFromFileName()

Parameters

Return values

hasHostUrl()

Parameters

Return values

isLocalhostUrl()

UrlParser
in package

Application