UrlParser
in package
Library of functions used to manipulate and to extract components from urls
Tags
Table of Contents
- canonicalLink() : string
- Given a $link that was obtained from a website $site, returns a complete URL for that link.
- checkRecursiveUrl() : bool
- Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times
- cleanRedundantLinks() : array<string|int, mixed>
- Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)
- countCompanyLevelDomainsInCommonDetectFarm() : int
- Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.
- cullByDomainFilter() : bool
- Checks if a url's host is either a company level domain (a cld) or is of the form www.cld or has as its cld a domain that is in one of the supplied BloomFilterFile objects
- extractTextFromUrl() : string
- Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents
- getBaseDomain() : string
- Gets the domain of a url less any leading www
- getCompanyLevelDomain() : string
- Calculates the company level domain for the given url
- getDocumentFilename() : string
- Gets the filename portion of a url if present; otherwise returns "Some File"
- getDocumentType() : string
- Given a url, makes a guess at the file type of the file it points to
- getFragment() : string
- Get the url fragment string component of a url
- getHost() : the
- Get the host name portion of a url if present; if not return false
- getHostAndPath() : array<string|int, mixed>
- Returns as a two element array the host and path of a url
- getHostPaths() : array<string|int, mixed>
- Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url
- getHostSubdomains() : array<string|int, mixed>
- Gets the subdomains of the host portion of a url. So
- getLang() : the
- Attempts to guess the language tag based on url
- getPath() : the
- Get the path portion of a url if present; if not return null
- getPort() : int
- Get the port number of a url if present; if not return 80
- getQuery() : string
- Get the query string component of a url
- getScheme() : int
- Get the scheme of a url if present; if not return http
- getWordsInHostUrl() : string
- Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .
- getWordsLastPathPartUrl() : string
- Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "
- guessFileSizeFromUrl() : int
- Used to guess the file size in bytes of the file that a url is pointed at based on its file type.
- guessMimeTypeFromFileName() : string
- Guess mime type based on extension of the file
- hasHostUrl() : bool
- Checks if the url has a host part.
- isLocalhostUrl() : bool
- Checks if a $url is on localhost
- isPathMemberRegexPaths() : bool
- Checks if $path matches against any of the Robots.txt style regex paths in $paths
- isSchemeCrawlable() : bool
- Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).
- pruneLinks() : array<string|int, mixed>
- Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.
- simplifyUrl() : string
- Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary
- urlMemberSiteArray() : mixed
- Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.
Methods
canonicalLink()
Given a $link that was obtained from a website $site, returns a complete URL for that link.
public
static canonicalLink(string $link, string $site[, string $no_fragment = true ]) : string
For example, the $link some_dir/test.html on the $site http://www.somewhere.com/bob would yield the complete url http://www.somewhere.com/bob/some_dir/test.html
Parameters
- $link : string
-
a relative or complete url
- $site : string
-
a base url
- $no_fragment : string = true
-
if false then if the url had a fragment (#link_within_page) then the fragment will be included
Return values
string —a complete url based on these two pieces of information
checkRecursiveUrl()
Checks if a url has a repeated set of subdirectories, and if the number of repeats occurs more than some threshold number of times
public
static checkRecursiveUrl(string $url[, int $repeat_threshold = 3 ]) : bool
A pattern like bob/.../bob counts as own repetition. bob/.../alice/.../bob/.../alice would count as two (... should be read as ellipsis, not a directory name).If the threshold is three and there are at least three repeated matches this function return true; it returns false otherwise.
Parameters
- $url : string
-
the url to check
- $repeat_threshold : int = 3
-
the number of repeats of a subdir name to trigger a true response
Return values
bool —whether a repeated subdirectory name with more matches than the threshold was found
cleanRedundantLinks()
Used to delete links from array of links $links based on whether they are the same as the site they came from (or otherwise judged irrelevant)
public
static cleanRedundantLinks(array<string|int, mixed> $links, string $parent_url) : array<string|int, mixed>
Parameters
- $links : array<string|int, mixed>
-
pairs of the form $link =>$link_info
- $parent_url : string
-
a site that the links were found on
Return values
array<string|int, mixed> —just those links which pass the relevancy test
countCompanyLevelDomainsInCommonDetectFarm()
Returns the number of links in the array $links which which share the same company level domain (cld) as $url For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations. It also tries to determine if a $url is potentially part of a link farm. To do this it checks (1) if the number of distinct, not sub-locale domains with a shared company domain is high > $threshold/2. This suggest a lot of bogus outgoing links that are all under one company's control. For example, a site www.foo.com linking to md5_hash.foo.com for many different md5 hashes.
public
static countCompanyLevelDomainsInCommonDetectFarm(string $url, array<string|int, mixed> $links[, int $threshold = 200 ]) : int
If this is detected this method returns -1. This method also returns -1 if (2) there seem to be lots of links ($threshold) from the current domain to a single domain that shares the same company domain. This might indicate a domain md5_hash.foo.com with lots of links to a domain www.foo.com
Parameters
- $url : string
-
the url to compare against $links
- $links : array<string|int, mixed>
-
an array of urls
- $threshold : int = 200
-
number above which if either situation (1) or (2) above happens then deem site spam
Return values
int —the number of times $url shares the cld with a link in $links. If thinks part of link farm returns -1
cullByDomainFilter()
Checks if a url's host is either a company level domain (a cld) or is of the form www.cld or has as its cld a domain that is in one of the supplied BloomFilterFile objects
public
static cullByDomainFilter(string $url, array<string|int, mixed> $filters) : bool
Parameters
- $url : string
-
url to check if in above form
- $filters : array<string|int, mixed>
-
array of BloomFilterFile objects
Return values
bool —whether or not url has above form
extractTextFromUrl()
Extracts text from a url. Similar to @see getWordsInHostUrl and @see getWordsLastPathPartUrl except operates on whole url. This function is mainly used on link documents, the previous two are mainly used with standard documents
public
static extractTextFromUrl(string $url) : string
Parameters
- $url : string
-
to find text that might say what link is about
Return values
string —heuristically derived text.
getBaseDomain()
Gets the domain of a url less any leading www
public
static getBaseDomain(string $url) : string
Parameters
- $url : string
-
to get domain of
Return values
string —the base domain as defined above
getCompanyLevelDomain()
Calculates the company level domain for the given url
public
static getCompanyLevelDomain(string $url) : string
For www.yahoo.com the cld is yahoo.com, for www.theregister.co.uk it is theregister.co.uk. It is similar for organizations.
Parameters
- $url : string
-
url to determine cld for
Return values
string —the cld of $url
getDocumentFilename()
Gets the filename portion of a url if present; otherwise returns "Some File"
public
static getDocumentFilename(string $url) : string
Parameters
- $url : string
-
a url to parse
Return values
string —the filename portion of this url
getDocumentType()
Given a url, makes a guess at the file type of the file it points to
public
static getDocumentType(string $url[, string $default = "html" ]) : string
Parameters
- $url : string
-
a url to figure out the file type for
- $default : string = "html"
-
default type to be returned in the case that document type cannot be determined from the url, defaults to html
Return values
string —the guessed file type.
getFragment()
Get the url fragment string component of a url
public
static getFragment(string $url) : string
Parameters
- $url : string
-
a url to get the url fragment string out of
Return values
string —the url fragment string if present; null otherwise
getHost()
Get the host name portion of a url if present; if not return false
public
static getHost(string $url[, bool $with_login_and_port = true ]) : the
Parameters
- $url : string
-
the url to parse
- $with_login_and_port : bool = true
-
whether to include user,password,port if present
Return values
the —host portion of the url if present; false otherwise
getHostAndPath()
Returns as a two element array the host and path of a url
public
static getHostAndPath(string $url[, bool $with_login_and_port = true ][, bool $with_query_string = false ]) : array<string|int, mixed>
Parameters
- $url : string
-
initial url to get host and path of
- $with_login_and_port : bool = true
-
controls whether the host should should contain login and port info
- $with_query_string : bool = false
-
says whether the path should contain the query string as well
Return values
array<string|int, mixed> —host and the path as a pair
getHostPaths()
Gets an array of prefix urls from a given url. Each prefix contains at least the the hostname of the the start url
public
static getHostPaths(string $url) : array<string|int, mixed>
http://host.com/b/c/ would yield http://host.com/ , http://host.com/b, http://host.com/b/, http://host.com/b/c, http://host.com/b/c/
Parameters
- $url : string
-
the url to extract prefixes from
Return values
array<string|int, mixed> —the array of url prefixes
getHostSubdomains()
Gets the subdomains of the host portion of a url. So
public
static getHostSubdomains(string $url) : array<string|int, mixed>
http://a.b.c/d/f/ will return a.b.c, .a.b.c, b.c, .b.c, c, .c
Parameters
- $url : string
-
the url to extract prefixes from
Return values
array<string|int, mixed> —the array of url prefixes
getLang()
Attempts to guess the language tag based on url
public
static getLang(string $url) : the
Parameters
- $url : string
-
the url to parse
Return values
the —top level domain if present; false otherwise
getPath()
Get the path portion of a url if present; if not return null
public
static getPath(string $url[, bool $with_query_string = false ]) : the
Parameters
- $url : string
-
the url to parse
- $with_query_string : bool = false
-
(whether to also include the query string at the end of the path)
Return values
the —host portion of the url if present; null otherwise
getPort()
Get the port number of a url if present; if not return 80
public
static getPort(string $url) : int
Parameters
- $url : string
-
the url to extract port number from
Return values
int —a port number
getQuery()
Get the query string component of a url
public
static getQuery(string $url) : string
Parameters
- $url : string
-
a url to get the query string out of
Return values
string —the query string if present; null otherwise
getScheme()
Get the scheme of a url if present; if not return http
public
static getScheme(string $url) : int
Parameters
- $url : string
-
the url to extract scheme from
Return values
int —a port number
getWordsInHostUrl()
Given a url, extracts the words in the host part of the url provided the url does not have a path part more than / .
public
static getWordsInHostUrl(string $url) : string
Ignores a leading www and also ignore tld.
For example, "http://www.yahoo.com/" returns " yahoo "
Parameters
- $url : string
-
a url to figure out the file type for
Return values
string —space separated words extracted.
getWordsLastPathPartUrl()
Given a url, extracts the words in the last path part of the url For example, http://us3.php.net/manual/en/function.array-filter.php yields " function array filter "
public
static getWordsLastPathPartUrl(string $url) : string
Parameters
- $url : string
-
a url to figure out the file type for
Return values
string —space separated words extracted.
guessFileSizeFromUrl()
Used to guess the file size in bytes of the file that a url is pointed at based on its file type.
public
static guessFileSizeFromUrl(string $url) : int
Parameters
- $url : string
-
to estimate the size of
Return values
int —estimated number of bytes
guessMimeTypeFromFileName()
Guess mime type based on extension of the file
public
static guessMimeTypeFromFileName(string $file_name[, string $default = 'text/plain' ]) : string
Parameters
- $file_name : string
-
name of the file
- $default : string = 'text/plain'
-
what mime type to return if mime type couldn't be determined
Return values
string —$mime_type for the given file name
hasHostUrl()
Checks if the url has a host part.
public
static hasHostUrl(string $url) : bool
Parameters
- $url : string
-
the url to check
Return values
bool —true if it does; false otherwise
isLocalhostUrl()
Checks if a $url is on localhost
public
static isLocalhostUrl(string $url) : bool
Parameters
- $url : string
-
the url to check
Return values
bool —whether or not it is on localhost
isPathMemberRegexPaths()
Checks if $path matches against any of the Robots.txt style regex paths in $paths
public
static isPathMemberRegexPaths(string $path, array<string|int, mixed> $robot_paths) : bool
Parameters
- $path : string
-
a path component of a url
- $robot_paths : array<string|int, mixed>
-
in format of robots.txt regex paths
Return values
bool —whether it is a member or not
isSchemeCrawlable()
Checks if the url scheme is either http, https, or gopher (old protocol but somewhat geeky-cool to still support).
public
static isSchemeCrawlable(string $url) : bool
Parameters
- $url : string
-
the url to check
Return values
bool —returns true if it is either http,https, or gopher and false otherwise
pruneLinks()
Prunes a list of url => text pairs down to max_link many pairs by choosing those whose text has the most information. Information crudely measured by the effective number of terms in the text.
public
static pruneLinks(array<string|int, mixed> $links[, int $max_links = CMAX_LINKS_TO_EXTRACT ]) : array<string|int, mixed>
To compute this, we count the number of terms by splitting on white space. We then multiply this by the ratio of the compressed length of the text divided by its uncompressed length.
Parameters
- $links : array<string|int, mixed>
-
list of pairs $url=>$text
- $max_links : int = CMAX_LINKS_TO_EXTRACT
-
maximum number of links from $links to return
Return values
array<string|int, mixed> —$out_links extracted from $links according to the description above.
simplifyUrl()
Converts a url with a scheme into one without. Also removes trailing slashes from url. Shortens url to desired length by inserting ellipsis for part of it if necessary
public
static simplifyUrl(string $url, int $max_len) : string
Parameters
- $url : string
-
the url to trim
- $max_len : int
-
length to shorten url to, 0 = no shortening
Return values
string —the trimmed url
urlMemberSiteArray()
Checks if the url belongs to one of the sites listed in site_array Sites can be either given in the form domain:host or in the form of a url in which case it is check that the site url is a substring of the passed url.
public
static urlMemberSiteArray(string $url, array<string|int, mixed> $site_array, string $name[, bool $return_rule = false ]) : mixed
Parameters
- $url : string
-
url to check
- $site_array : array<string|int, mixed>
-
sites to check against
- $name : string
-
identifier to store $site_array with in this public function's cache
- $return_rule : bool = false
-
whether when a match is found to return true or to return the matching site rule
Return values
mixed —whether the url belongs to one of the sites