Yioop_V9.5_Source_Code_Documentation

FetchUrl
in package
implements CrawlConstants

Code used to manage HTTP or Gopher requests from one or more URLS

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$local_ip_cache  : array<string|int, mixed>
a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster
checkResponseForErrors()  : mixed
Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog
computePageHash()  : string
Computes a hash of a string containing page data for use in deduplication of pages with similar content
getCurlIp()  : string
Computes the IP address from http get-responser header
getPage()  : string
Make a curl request for the provided url
getPages()  : array<string|int, mixed>
Make multi_curl requests for an array of sites with urls or onion urls
parseHeaderPage()  : array<string|int, mixed>
Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.
prepareUrlHeaders()  : array<string|int, mixed>
Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.

Properties

$local_ip_cache

a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster

public static array<string|int, mixed> $local_ip_cache = []

Methods

checkResponseForErrors()

Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog

public static checkResponseForErrors(string $response) : mixed
Parameters
$response : string

getPage response in which to check for errors

Return values
mixed

computePageHash()

Computes a hash of a string containing page data for use in deduplication of pages with similar content

public static computePageHash(string $page[, string $type = "not-text" ]) : string
Parameters
$page : string

reference to web page data

$type : string = "not-text"

for now either text or not-text. If text then some tags are removed before computing the hash

Return values
string

8 byte hash to identify page contents

getCurlIp()

Computes the IP address from http get-responser header

public static getCurlIp(string $header) : string
Parameters
$header : string

contains complete transcript of HTTP get/response

Return values
string

IPv4 address as a string of dot separated quads.

getPage()

Make a curl request for the provided url

public static getPage(string $site[, array<string|int, mixed> $post_data = null ][, bool $check_for_errors = false ][, string $user_password = null ][,  $timeout = CSINGLE_PAGE_TIMEOUT ]) : string
Parameters
$site : string

url of page to request

$post_data : array<string|int, mixed> = null

any data to be POST'd to the URL

$check_for_errors : bool = false

whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server

$user_password : string = null

username:password to use for connection if needed (optional)

$timeout : = CSINGLE_PAGE_TIMEOUT

how long to wait for page download to complete

Return values
string

the contents of what the curl request fetched

getPages()

Make multi_curl requests for an array of sites with urls or onion urls

public static getPages(array<string|int, mixed> $sites[, bool $timer = false ][, int $page_range_request = CPAGE_RANGE_REQUEST ][, string $temp_dir = "" ][, string $key = CrawlConstants::URL ][, string $value = CrawlConstants::PAGE ][, bool $minimal = false ][, array<string|int, mixed> $post_data = null ][, bool $follow = false ][, string $tor_proxy = "" ][, array<string|int, mixed> $proxy_servers = [] ]) : array<string|int, mixed>
Parameters
$sites : array<string|int, mixed>

an array containing urls of pages to request

$timer : bool = false

flag, true means print timing statistics to log

$page_range_request : int = CPAGE_RANGE_REQUEST

maximum number of bytes to download/page 0 means download all

$temp_dir : string = ""

folder to store temporary ip header info

$key : string = CrawlConstants::URL

the component of $sites[$i] that has the value of a url to get defaults to URL

$value : string = CrawlConstants::PAGE

component of $sites[$i] in which to store the page that was gotten

$minimal : bool = false

if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs. Also will not use Expect header

$post_data : array<string|int, mixed> = null

data to be POST'd to each site

$follow : bool = false

whether to follow redirects or not

$tor_proxy : string = ""

url of a proxy that knows how to download .onion urls

$proxy_servers : array<string|int, mixed> = []

if not [], then an array of proxy server to use rather than to directly download web pages from the current machine

Return values
array<string|int, mixed>

an updated array with the contents of those pages

parseHeaderPage()

Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.

public static parseHeaderPage(string $header_and_page[, string $value = CrawlConstants::PAGE ]) : array<string|int, mixed>
Parameters
$header_and_page : string

string of downloaded data

$value : string = CrawlConstants::PAGE

field to store the page portion of page

Return values
array<string|int, mixed>

info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.

prepareUrlHeaders()

Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.

public static prepareUrlHeaders(string $url[, array<string|int, mixed> $proxy_servers = [] ][, string $temp_dir = "" ][, bool $expect_header = true ]) : array<string|int, mixed>

returns a modified url, returns an array of headers for the etag and to keep lighttpd happy, returns a dns resolve string used by curl for IP caching, and returns a separated out referer string. If some of the the fields in the extended url are not present such as ip, it attempts to compute them by looking in a cache in temp_dir. For other values, such as referer, if not present, an empty string is return, and for Etag the header is just not added to the array of headers returned.

Parameters
$url : string

site to download with ip address at end potentially after ###

$proxy_servers : array<string|int, mixed> = []

if not empty an array of proxy servers used to crawl through

$temp_dir : string = ""

folder to store temporary ip header info

$expect_header : bool = true

whether to send the Expect: 100-continue header

Return values
array<string|int, mixed>

5-tuple (orig url, url with replacement, http header array, column separated string with host:port:ip -- used for curl hashing, referer)


        

Search results