Yioop_V9.5_Source_Code

FetchUrl
in package

Application

implements CrawlConstants

Code used to manage HTTP or Gopher requests from one or more URLS

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

$local_ip_cache : array<string|int, mixed>: a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster
checkResponseForErrors() : mixed: Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog
computePageHash() : string: Computes a hash of a string containing page data for use in deduplication of pages with similar content
getCurlIp() : string: Computes the IP address from http get-responser header
getPage() : string: Make a curl request for the provided url
getPages() : array<string|int, mixed>: Make multi_curl requests for an array of sites with urls or onion urls
parseHeaderPage() : array<string|int, mixed>: Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.
prepareUrlHeaders() : array<string|int, mixed>: Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.

$local_ip_cache

a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster


    public
    static    array<string|int, mixed>
    $local_ip_cache
     = []

checkResponseForErrors()

Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog


    public
            static        checkResponseForErrors(string $response) : mixed

Parameters

$response : string: getPage response in which to check for errors

Return values

mixed —

computePageHash()

Computes a hash of a string containing page data for use in deduplication of pages with similar content


    public
            static        computePageHash(string $page[, string $type = "not-text" ]) : string

Parameters

$page : string: reference to web page data
$type : string = "not-text": for now either text or not-text. If text then some tags are removed before computing the hash

Return values

string —

8 byte hash to identify page contents

getCurlIp()

Computes the IP address from http get-responser header


    public
            static        getCurlIp(string $header) : string

Parameters

$header : string: contains complete transcript of HTTP get/response

Return values

string —

IPv4 address as a string of dot separated quads.

getPage()

Make a curl request for the provided url


    public
            static        getPage(string $site[, array<string|int, mixed> $post_data = null ][, bool $check_for_errors = false ][, string $user_password = null ][,  $timeout = CSINGLE_PAGE_TIMEOUT ]) : string

Parameters

$site : string: url of page to request
$post_data : array<string|int, mixed> = null: any data to be POST'd to the URL
$check_for_errors : bool = false: whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server
$user_password : string = null: username:password to use for connection if needed (optional)
$timeout : = CSINGLE_PAGE_TIMEOUT: how long to wait for page download to complete

Return values

string —

the contents of what the curl request fetched

getPages()

Make multi_curl requests for an array of sites with urls or onion urls


    public
            static        getPages(array<string|int, mixed> $sites[, bool $timer = false ][, int $page_range_request = CPAGE_RANGE_REQUEST ][, string $temp_dir = "" ][, string $key = CrawlConstants::URL ][, string $value = CrawlConstants::PAGE ][, bool $minimal = false ][, array<string|int, mixed> $post_data = null ][, bool $follow = false ][, string $tor_proxy = "" ][, array<string|int, mixed> $proxy_servers = [] ]) : array<string|int, mixed>

Parameters

$sites : array<string|int, mixed>: an array containing urls of pages to request
$timer : bool = false: flag, true means print timing statistics to log
$page_range_request : int = CPAGE_RANGE_REQUEST: maximum number of bytes to download/page 0 means download all
$temp_dir : string = "": folder to store temporary ip header info
$key : string = CrawlConstants::URL: the component of $sites[$i] that has the value of a url to get defaults to URL
$value : string = CrawlConstants::PAGE: component of $sites[$i] in which to store the page that was gotten
$minimal : bool = false: if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs. Also will not use Expect header
$post_data : array<string|int, mixed> = null: data to be POST'd to each site
$follow : bool = false: whether to follow redirects or not
$tor_proxy : string = "": url of a proxy that knows how to download .onion urls
$proxy_servers : array<string|int, mixed> = []: if not [], then an array of proxy server to use rather than to directly download web pages from the current machine

Return values

array<string|int, mixed> —

an updated array with the contents of those pages

parseHeaderPage()

Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.


    public
            static        parseHeaderPage(string $header_and_page[, string $value = CrawlConstants::PAGE ]) : array<string|int, mixed>

Parameters

$header_and_page : string: string of downloaded data
$value : string = CrawlConstants::PAGE: field to store the page portion of page

Return values

array<string|int, mixed> —

info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.

prepareUrlHeaders()

Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.


    public
            static        prepareUrlHeaders(string $url[, array<string|int, mixed> $proxy_servers = [] ][, string $temp_dir = "" ][, bool $expect_header = true ]) : array<string|int, mixed>

returns a modified url, returns an array of headers for the etag and to keep lighttpd happy, returns a dns resolve string used by curl for IP caching, and returns a separated out referer string. If some of the the fields in the extended url are not present such as ip, it attempts to compute them by looking in a cache in temp_dir. For other values, such as referer, if not present, an empty string is return, and for Etag the header is just not added to the array of headers returned.

Parameters

$url : string: site to download with ip address at end potentially after ###
$proxy_servers : array<string|int, mixed> = []: if not empty an array of proxy servers used to crawl through
$temp_dir : string = "": folder to store temporary ip header info
$expect_header : bool = true: whether to send the Expect: 100-continue header

Return values

array<string|int, mixed> —

5-tuple (orig url, url with replacement, http header array, column separated string with host:port:ip -- used for curl hashing, referer)

Yioop_V9.5_Source_Code_Documentation

FetchUrl
in package

Application

implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$local_ip_cache

Methods

checkResponseForErrors()

Parameters

Return values

computePageHash()

Parameters

Return values

getCurlIp()

Parameters

Return values

getPage()

Parameters

Return values

getPages()

Parameters

Return values

parseHeaderPage()

Parameters

Return values

prepareUrlHeaders()

Parameters

Return values

Search results

FetchUrl in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$local_ip_cache

Methods

checkResponseForErrors()

Parameters

Return values

computePageHash()

Parameters

Return values

getCurlIp()

Parameters

Return values

getPage()

Parameters

Return values

getPages()

Parameters

Return values

parseHeaderPage()

Parameters

Return values

prepareUrlHeaders()

Parameters

Return values

FetchUrl
in package

Application

implements CrawlConstants