FetchUrl
in package
implements
CrawlConstants
Code used to manage HTTP or Gopher requests from one or more URLS
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $local_ip_cache : array<string|int, mixed>
- a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster
- checkResponseForErrors() : mixed
- Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog
- computePageHash() : string
- Computes a hash of a string containing page data for use in deduplication of pages with similar content
- getCurlIp() : string
- Computes the IP address from http get-responser header
- getPage() : string
- Make a curl request for the provided url
- getPages() : array<string|int, mixed>
- Make multi_curl requests for an array of sites with urls or onion urls
- parseHeaderPage() : array<string|int, mixed>
- Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.
- prepareUrlHeaders() : array<string|int, mixed>
- Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.
Properties
$local_ip_cache
a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster
public
static array<string|int, mixed>
$local_ip_cache
= []
Methods
checkResponseForErrors()
Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog
public
static checkResponseForErrors(string $response) : mixed
Parameters
- $response : string
-
getPage response in which to check for errors
Return values
mixed —computePageHash()
Computes a hash of a string containing page data for use in deduplication of pages with similar content
public
static computePageHash(string $page[, string $type = "not-text" ]) : string
Parameters
- $page : string
-
reference to web page data
- $type : string = "not-text"
-
for now either text or not-text. If text then some tags are removed before computing the hash
Return values
string —8 byte hash to identify page contents
getCurlIp()
Computes the IP address from http get-responser header
public
static getCurlIp(string $header) : string
Parameters
- $header : string
-
contains complete transcript of HTTP get/response
Return values
string —IPv4 address as a string of dot separated quads.
getPage()
Make a curl request for the provided url
public
static getPage(string $site[, array<string|int, mixed> $post_data = null ][, bool $check_for_errors = false ][, string $user_password = null ][, $timeout = CSINGLE_PAGE_TIMEOUT ]) : string
Parameters
- $site : string
-
url of page to request
- $post_data : array<string|int, mixed> = null
-
any data to be POST'd to the URL
- $check_for_errors : bool = false
-
whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server
- $user_password : string = null
-
username:password to use for connection if needed (optional)
- $timeout : = CSINGLE_PAGE_TIMEOUT
-
how long to wait for page download to complete
Return values
string —the contents of what the curl request fetched
getPages()
Make multi_curl requests for an array of sites with urls or onion urls
public
static getPages(array<string|int, mixed> $sites[, bool $timer = false ][, int $page_range_request = CPAGE_RANGE_REQUEST ][, string $temp_dir = "" ][, string $key = CrawlConstants::URL ][, string $value = CrawlConstants::PAGE ][, bool $minimal = false ][, array<string|int, mixed> $post_data = null ][, bool $follow = false ][, string $tor_proxy = "" ][, array<string|int, mixed> $proxy_servers = [] ]) : array<string|int, mixed>
Parameters
- $sites : array<string|int, mixed>
-
an array containing urls of pages to request
- $timer : bool = false
-
flag, true means print timing statistics to log
- $page_range_request : int = CPAGE_RANGE_REQUEST
-
maximum number of bytes to download/page 0 means download all
- $temp_dir : string = ""
-
folder to store temporary ip header info
- $key : string = CrawlConstants::URL
-
the component of $sites[$i] that has the value of a url to get defaults to URL
- $value : string = CrawlConstants::PAGE
-
component of $sites[$i] in which to store the page that was gotten
- $minimal : bool = false
-
if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs. Also will not use Expect header
- $post_data : array<string|int, mixed> = null
-
data to be POST'd to each site
- $follow : bool = false
-
whether to follow redirects or not
- $tor_proxy : string = ""
-
url of a proxy that knows how to download .onion urls
- $proxy_servers : array<string|int, mixed> = []
-
if not [], then an array of proxy server to use rather than to directly download web pages from the current machine
Return values
array<string|int, mixed> —an updated array with the contents of those pages
parseHeaderPage()
Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.
public
static parseHeaderPage(string $header_and_page[, string $value = CrawlConstants::PAGE ]) : array<string|int, mixed>
Parameters
- $header_and_page : string
-
string of downloaded data
- $value : string = CrawlConstants::PAGE
-
field to store the page portion of page
Return values
array<string|int, mixed> —info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.
prepareUrlHeaders()
Given an extended url in the format url###ip_address###referer###Etag, a list of proxy servers, and a temporary directly for ip lookup.
public
static prepareUrlHeaders(string $url[, array<string|int, mixed> $proxy_servers = [] ][, string $temp_dir = "" ][, bool $expect_header = true ]) : array<string|int, mixed>
returns a modified url, returns an array of headers for the etag and to keep lighttpd happy, returns a dns resolve string used by curl for IP caching, and returns a separated out referer string. If some of the the fields in the extended url are not present such as ip, it attempts to compute them by looking in a cache in temp_dir. For other values, such as referer, if not present, an empty string is return, and for Etag the header is just not added to the array of headers returned.
Parameters
- $url : string
-
site to download with ip address at end potentially after ###
- $proxy_servers : array<string|int, mixed> = []
-
if not empty an array of proxy servers used to crawl through
- $temp_dir : string = ""
-
folder to store temporary ip header info
- $expect_header : bool = true
-
whether to send the Expect: 100-continue header
Return values
array<string|int, mixed> —5-tuple (orig url, url with replacement, http header array, column separated string with host:port:ip -- used for curl hashing, referer)