SeekQuarry/Yioop -- Open Source Pure PHP Search Engine, Crawler, and Indexer
Copyright (C) 2009 - 2023 Chris Pollett chris@pollett.org
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
A library of string, error reporting, log, hash, time, and conversion functions
Interfaces, Classes, Traits and Enums
- Mod9Constants
- Mini-class (so not own file) used to hold encode decode info related to Mod9 encoding (as variant of Simplified-9 specify to Yioop).
Table of Contents
- addRegexDelimiters() : string
- Adds delimiters to a regex that may or may not have them
- preg_search() : mixed
- search for a pcre pattern in a subject from a given offset, return position of first match if found -1 otherwise.
- preg_offset_replace() : string
- Replaces a pcre pattern with a replacement in $subject starting from some offset.
- parse_ini_with_fallback() : array<string|int, mixed>
- Yioop replacement for parse_ini_file($name, true) in case parse_ini_file is on the disable_functions list. Name has underscores to match original function. This function checks if parse_ini_file is disabled on not. If not, it just calls parse_ini_file; otherwise, it simulates it enough so that configure.ini files used for string translations can be read.
- getIniAssignMatch() : mixed
- Auxiliary function called from parse_ini_with_fallback to extract from the $matches array produced by the former function's preg_match what kind of assignment occurred in the ini file being parsed.
- charCopy() : mixed
- Copies from $source string beginning at position $start, $length many bytes to destination string
- vByteEncode() : string
- Encodes an integer using variable byte coding.
- vByteDecode() : int
- Decodes from a string using variable byte coding an integer.
- appendUnary() : mixed
- Appends a number re-encoded in unary to the end of an input string starting at a given bit offset into the string. Here n in unary has bit representation n-1 0's followed by a 1.
- decodeUnary() : int
- Decodes a unary number froman input string at a given bit offset. Here n in unary has bit representation n-1 0's followed by a 1.
- appendBits() : string
- Appends $num_bits bits from the start of the binary rep of $number beginning at offset $start_bit_offset of $input string overwriting any bits present. If $num_bits == -1, then appends all of $number.
- decodeBits() : int
- Decode $num_bits many bits from the $input string beginning at offset $start_bit_offset. The result of this operation is up $start_bit_offset by number of bits that were able to be decoded.
- appendGamma() : string
- Appends gamma code of $number beginning at offset $start_bit_offset of $input string overwriting any bits present. $start_bit_offset is updated to bit position after append.
- decodeGammaList() : array<string|int, mixed>
- Decodes up to $num_decode gamma encoded integers beginning at $start_bit_offset. $start_bit_offset is updated to the bit position after the decoded integers.
- appendRiceSequence() : string
- Appends using a Rice coding a sequence of integers $int_sequence at offset $start_bit_offset to the string $output, overwriting any bits present at that location. $start_bit_offset is updated to bit position after append.
- decodeRiceSequence() : array<string|int, mixed>
- Decodes up to $num_decode rice encoded difference list of integers beginning at $start_bit_offset. $start_bit_offset is updated to the bit position after the decoded integers. If $delta_start >= 0 then the first int is assumed to be the difference from $delta_start;
- encodePositionList() : string
- Encodes a list of integer positions of a term in a document. This is done as a gamma code of the first integer followed by the Rice coding of the remaining integers using a modulus based on the average gap between integers. If the number of positions is 1 or 2 then a gamma of each position only is used.
- decodePositionList() : array<string|int, mixed>
- Decodes up to $num_decode term in document position integers from string $input under the assumption $input is encoded as per
- encode255() : string
- Recodes a string in a 1-1 fashion to a string not involving \xFF (255). I.e., it maps characters \xFE -> \xFE\FD and \xFF -> \xFE\FE
- decode255() : string
- Decodes a string in a 1-1 fashion from a string not involving \xFF (255). I.e., it maps characters \xFE\FE -> \xFF and \xFE\FD -> \xFF
- encodeUnderscore() : string
- Recodes a string in a 1-1 fashion to a string not involving underscore (_). I.e., it maps characters - -> -- and _ -> -=
- decodeUnderscore() : string
- Decodes a string in a 1-1 fashion from a string not involving underscore (_). I.e., it maps characters -= -> _ and -- -> -
- packEncode255() : string
- Encodes a list of strings as their @see encode255 versions separated by \xFF's
- unpackDecode255() : array<string|int, mixed>
- Decodes a list of strings from a string that encoded as their @see encode255 of its elements separated by \xFF's
- packPosting() : string
- Makes an packed integer string from a docindex and the number of occurrences of a word in the document with that docindex.
- unpackPosting() : array<string|int, mixed>
- Given a packed integer string, uses the top three bytes to calculate a doc_index of a document in the shard, and uses the low order byte to computer a number of occurrences of a word in that document.
- addDocIndexPostings() : string
- This method is used while appending one index shard to another.
- deltaList() : array<string|int, mixed>
- Computes the difference of a list of integers.
- deDeltaList() : array<string|int, mixed>
- Given an array of differences of integers reconstructs the original list. This computes the inverse of the deltaList function
- encodeModified9() : string
- Encodes a sequence of integers x, such that 1 <= x <= 2<<28-1 as a string. NOTICE x>=1.
- packListModified9() : string
- Packs the contents of a single word of a sequence being encoded using Modified9.
- nextPostString() : string
- Returns the next complete posting string from $input_string being at offset.
- decodeModified9() : array<string|int, mixed>
- Decoded a sequence of positive integers from a string that has been encoded using Modified 9
- unpackListModified9() : array<string|int, mixed>
- Decode a single word with high two bits off according to modified 9
- docIndexModified9() : int
- Given an int encoding encoding a doc_index followed by a position list using Modified 9, extracts just the doc_index.
- unpackInt() : int
- Unpacks an int from a 4 char string
- packInt() : string
- Packs an int into a 4 char string
- unpackFloat() : float
- Unpacks a float from a 4 char string
- packFloat() : string
- Packs an float into a four char string
- renameSerializedObject() : string
- Used to change the namespace of a serialized php object (assumes doesn't have nested subobjects)
- getDomFromString() : DOMDocument
- Parses a provided string to make a DOM object. First tries to parse using XML and if this fails uses the more robust HTML Dom parser and manipulates the resulting DOM tree to make correspond to original tags for XML that isn't HTML
- getTags() : array<string|int, mixed>
- Returns an array of DOMDocuments for the nodes that match an xpath query on $dom, a DOMDocument
- toHexString() : string
- Converts a string to string where each char has been replaced by its hexadecimal equivalent
- toIntString() : string
- Converts a string to string where each char has been replaced by a Integer equivalent
- toBinString() : string
- Converts a string to string where each char has been replaced by its binary equivalent
- metricToInt() : int
- Converts a string of the form some int followed by K, M, or G.
- intToMetric() : string
- Converts a number to a string followed by nothing, K, M, G, T depending on whether number is < 1000, < 10^6, < 10^9, or < 10^(12)
- crawlLog() : mixed
- Logs a message to a logfile or the screen. The super-global field $_SERVER['LOG_TO_FILES'] determines if this will log to a file. If not, then in cli mode, will log to stdout, otherwise it will use error_log. When logging to file $_SERVER["NO_ROTATE_LOGS"] controls whether or not there will be a log file rotation. The first call to this method is typically used to set up a process to check for liveness. For example a call: crawlLog("\n\nInitialize logger..", $this->process_name, true); says $this->process_name should be checked for liveness as part of any subsequent logging activity such as a call crawlLog("Another Message"); (note subsequent call don't need to specify the process name).
- makeTimestamp() : string
- Used to make a log file entry time string of format: entry number, time in r format.
- crawlTimeoutLog() : bool
- Writes a log message $msg if more than LOG_TIMEOUT time has passed since the last time crawlTimeoutLog was called. Useful in loops to write a message as progress is made through the loop (but not on every iteration, but say every 30 seconds).
- crawlHash() : string
- Computes an 8 byte hash of a string for use in storing documents.
- crawlHashWord() : string
- Used to create a 20 byte hash of a string (typically a word or phrase with a wikipedia page). Format is 8 byte crawlHash of term (md5 of term two halves XOR'd), followed by a \x00, followed by the first 11 characters from the term. If there are not enough char's to make 20 bytes, then the string is padded with \x00s to 20bytes.
- canonicalTerm() : string
- Take a $term that might have come from adocuments and converts it to a string of 16 bytes which is either the original term padded by underscores or the first seven chars of the term followed by an underscore followed by the base64 encoding of the first 6 chars of its md5 hash.
- compareWordHashes() : int
- Used to compare to ids for index dictionary lookup. ids are a 8 byte crawlHash together with 12 byte non-hash suffix.
- base64Hash() : string
- Converts a crawl hash number to something closer to base64 coded but so doesn't get confused in urls or DBs
- unbase64Hash() : string
- Decodes a crawl hash number from base64 to raw ASCII
- webencode() : string
- Encodes a string in a format suitable for post data (mainly, base64, but str_replace data that might mess up post in result)
- webdecode() : string
- Decodes a string encoded by webencode
- crawlCrypt() : string
- The crawlHash function is used to encrypt passwords stored in the database.
- partitionByHash() : array<string|int, mixed>
- Used by a controller to take a table and return those rows in the table that a given queue_server would be responsible for handling
- calculatePartition() : int
- Used by a controller to say which queue_server should receive a given input
- changeInMicrotime() : float
- Measures the change in time in seconds between two timestamps to microsecond precision
- microTimestamp() : string
- Timestamp of current epoch with microsecond precision useful for situations where time() might cause too many collisions (account creation, etc)
- checkTimeInterval() : int
- Checks that a timestamp is within the time interval given by a start time (HH:mm) and a duration
- convertPixels() : int
- Converts a CSS unit string into its equivalent in pixels. This is used by @see SvgProcessor.
- countFiles() : int
- Returns the number of files in a folder
- makePath() : bool
- Creates folders along a filesystem path if they don't exist
- deleteFileOrDir() : mixed
- This is a callback function used in the process of recursively deleting a directory
- setWorldPermissions() : mixed
- This is a callback function used in the process of recursively chmoding to 777 all files in a folder
- fileInfo() : an
- This is a callback function used in the process of recursively calculating an array of file modification times and files sizes for a directory
- orderCallback() : int
- Callback function used to sort documents by a field
- stringOrderCallback() : int
- Callback function used to sort documents by a field where field is assume to be a string
- stringROrderCallback() : int
- Callback function used to sort documents by a field where field is assume to be a string
- rorderCallback() : int
- Callback function used to sort documents by a field in reverse order
- lessThan() : int
- Callback to check if $a is less than $b
- greaterThan() : int
- Callback to check if $a is greater than $b
- e() : mixed
- shorthand for echo
- remoteAddress() : mixed
- Compute the real remote address of the incoming connection including forwarding
- readInput() : string
- Used to read a line of input from the command-line
- readPassword() : string
- Used to read a line of input from the command-line (on unix machines without echoing it)
- readMessage() : string
- Used to read a several lines from the terminal up until a last line consisting of just a "."
- mimeType() : string
- Returns the mime type of the provided file name if it can be determined.
- generalIsA() : bool
- Checks if class_1 is the same as class_2 or has class_2 as a parent Behaves like 3 param version (last param true) of PHP is_a function that came into being with Version 5.3.9.
- stripAttributes() : string
- Given the contents of a start XML/HMTL tag strips out all the attributes non listed in $safe_attribute_list
- parseCsv() : array<string|int, mixed>
- Used to parse into a two dimensional array a string that contains CSV data.
- arraytoCsv() : string
- Converts an array of values to a comma separated value formatted string.
- diff() : string
- Computes a Unix-style diff of two strings. That is it only outputs lines which disagree between the two strings. It outputs +line if a line occurs in the second but not first string and -line if a line occurs in the first string but not the second.
- computeLCS() : mixed
- Computes the longest common subsequence of two arrays
- extractLCSFromTable() : mixed
- Extracts from a table of longest common sequence moves (probably calculated by @see computeLCS) and a starting coordinate $i, $j in that table, a longest common subsequence
- tail() : array<string|int, mixed>
- Returns an array of the last $num_lines many lines our of a file
- lineFilter() : array<string|int, mixed>
- Given an array of lines returns a subarray of those lines containing the filter string or filter array
- logLineTimestamp() : int
- Tries to extract a timestamp from a line which is presumed to come from a Yioop log file
- isPositiveInteger() : bool
- Returns whether an input can be parsed to a positive integer
- measureCall() : mixed
- Used to measure the memory footprint in bytes and time spent calling a method of an object. It also records number of time the method has been called.
- measureObject() : mixed
- Used to measure the memory footprint of an object in Yioop and save it to a statistics file No recording is done until an initial call to the function measureCall(null, save_statistics_file) where save_statistics_file is the name of the file you won't to store statistics to.
- measureObjectCall() : mixed
- General method called by for @see measureCall and @see measureObject Used to measure the memory footprint in bytes of an object or memory and time spent calling a method of an object. It also records number of time the method has been called. When used to call a method before initialization, just calls the method without any recording or timing. To initialize, an initial call to the function measureCall(null, save_statistics_file) where save_statistics_file is the name of the file you won't to store statistics to should be done.
- variableClone() : mixed
- Makes a deep copy of a variable regardless of its type
- garbageCollect() : int
- Runs various system garbage collection functions and returns number of bytes freed.
- utf8SafeSaveHtml() : string
- The dom method saveHTML has a tendency to replace UTF-8, non-ascii characters with html entities. This is supposed to save avoiding the replacement.
- utf8WordWrap() : string
- A UTF-8 safe version of PHP's wordwrap function that wraps a string to a given number of characters
Adds delimiters to a regex that may or may not have them
addRegexDelimiters(string $expression) : string
- $expression : string
a regex
Return values
string —rgex with delimiters if not there
search for a pcre pattern in a subject from a given offset, return position of first match if found -1 otherwise.
preg_search(string $pattern, string $subject, int $offset[, bool $return_match = false ]) : mixed
- $pattern : string
a Perl compatible regular expression
- $subject : string
to search for pattern in
- $offset : int
character offset into $subject to begin searching from
- $return_match : bool = false
whether to return as well what the match was for the pattern
Return values
mixed —if $return_match is false then the integer position of first match, otherwise, it returns the ordered pair [$pos, $match].
Replaces a pcre pattern with a replacement in $subject starting from some offset.
preg_offset_replace(string $pattern, string $replacement, string $subject, int $offset) : string
- $pattern : string
a Perl compatible regular expression
- $replacement : string
what to replace the pattern with
- $subject : string
to search for pattern in
- $offset : int
character offset into $subject to begin searching from
Return values
string —result of the replacements
Yioop replacement for parse_ini_file($name, true) in case parse_ini_file is on the disable_functions list. Name has underscores to match original function. This function checks if parse_ini_file is disabled on not. If not, it just calls parse_ini_file; otherwise, it simulates it enough so that configure.ini files used for string translations can be read.
parse_ini_with_fallback(string $file) : array<string|int, mixed>
- $file : string
filename of ini data to parse into an array
Return values
array<string|int, mixed> —data parse from file
Auxiliary function called from parse_ini_with_fallback to extract from the $matches array produced by the former function's preg_match what kind of assignment occurred in the ini file being parsed.
getIniAssignMatch(string $matches) : mixed
- $matches : string
produced by a preg_match in parse_ini_with_fallback
Return values
mixed —value of ini file assignment
Copies from $source string beginning at position $start, $length many bytes to destination string
charCopy(string $source, string &$destination, int $start, int $length[, string $timeout_msg = "" ]) : mixed
- $source : string
string to copy from
- $destination : string
string to copy to
- $start : int
starting offset
- $length : int
number of bytes to copy
- $timeout_msg : string = ""
message to print if taking more than 30 seconds
Return values
mixed —vByteEncode()
Encodes an integer using variable byte coding.
vByteEncode(int $pos_int) : string
- $pos_int : int
integer to encode
Return values
string —a string of 1-5 chars depending on how bit $pos_int was
Decodes from a string using variable byte coding an integer.
vByteDecode(string $str, int &$offset) : int
- $str : string
string to use for decoding
- $offset : int
byte offset into string when var int stored
Return values
int —the decoded integer
Appends a number re-encoded in unary to the end of an input string starting at a given bit offset into the string. Here n in unary has bit representation n-1 0's followed by a 1.
appendUnary(int $number, mixed $input, mixed &$start_bit_offset[, mixed $just_bit_offset = false ]) : mixed
- $number : int
number to append
- $input : mixed
- $start_bit_offset : mixed
- $just_bit_offset : mixed = false
Return values
mixed —either the resulting string or its length
Decodes a unary number froman input string at a given bit offset. Here n in unary has bit representation n-1 0's followed by a 1.
decodeUnary(string $input, int &$start_bit_offset) : int
- $input : string
the string that we want to decode a unary number from
- $start_bit_offset : int
the starting bit offset in $input to start decoding from. After the call it will be the position after the decode
Return values
int —the decoded unary number
Appends $num_bits bits from the start of the binary rep of $number beginning at offset $start_bit_offset of $input string overwriting any bits present. If $num_bits == -1, then appends all of $number.
appendBits(int $number, string $input, int &$start_bit_offset[, $num_bits = -1 ]) : string
- $number : int
to append
- $input : string
the string to append to.
- $start_bit_offset : int
starting location to begin append from
- $num_bits : = -1
number of bits of $input to append.
Return values
string —resulting string
Decode $num_bits many bits from the $input string beginning at offset $start_bit_offset. The result of this operation is up $start_bit_offset by number of bits that were able to be decoded.
decodeBits(string $input, int &$start_bit_offset, int $num_bits) : int
- $input : string
string to decode bits from
- $start_bit_offset : int
bit offset to start decoding from in $input
- $num_bits : int
number of bits tot try to decode
Return values
int —the number decoded
Appends gamma code of $number beginning at offset $start_bit_offset of $input string overwriting any bits present. $start_bit_offset is updated to bit position after append.
appendGamma(int $number, string $input, int &$start_bit_offset) : string
- $number : int
to append
- $input : string
the string to append to.
- $start_bit_offset : int
starting bit location to begin append from
Return values
string —resulting string
Decodes up to $num_decode gamma encoded integers beginning at $start_bit_offset. $start_bit_offset is updated to the bit position after the decoded integers.
decodeGammaList(string $input, int &$start_bit_offset, int $num_decode) : array<string|int, mixed>
- $input : string
the string to decode from
- $start_bit_offset : int
starting bit location to decode from
- $num_decode : int
number of int's to decode
Return values
array<string|int, mixed> —decoded int's
Appends using a Rice coding a sequence of integers $int_sequence at offset $start_bit_offset to the string $output, overwriting any bits present at that location. $start_bit_offset is updated to bit position after append.
appendRiceSequence(array<string|int, mixed> $int_sequence, int $modulus, string $output, int &$start_bit_offset[, int $delta_start = -1 ]) : string
Encoding is done as a difference list. If $delta_start is set to a value other than >= then the first gap is assumed to be from int $delta_start
- $int_sequence : array<string|int, mixed>
int's to append
- $modulus : int
i in the 2^i modulus to use for Rice code
- $output : string
the string to append to.
- $start_bit_offset : int
starting bit location to begin append from
- $delta_start : int = -1
if >= 0 previous int to use for difference list otherwise the first integer is encoded as itself rather than a difference
Return values
string —resulting string
Decodes up to $num_decode rice encoded difference list of integers beginning at $start_bit_offset. $start_bit_offset is updated to the bit position after the decoded integers. If $delta_start >= 0 then the first int is assumed to be the difference from $delta_start;
decodeRiceSequence(string $input, int &$start_bit_offset, int $num_decode[, int $delta_start = -1 ]) : array<string|int, mixed>
- $input : string
the string to decode from
- $start_bit_offset : int
starting bit location to decode from
- $num_decode : int
number of int's to decode
- $delta_start : int = -1
if >= 0 previous int to use for difference list otherwise the first integer is decoded as itself rather than a difference
Return values
array<string|int, mixed> —decoded int's
Encodes a list of integer positions of a term in a document. This is done as a gamma code of the first integer followed by the Rice coding of the remaining integers using a modulus based on the average gap between integers. If the number of positions is 1 or 2 then a gamma of each position only is used.
encodePositionList(array<string|int, mixed> $positions) : string
- $positions : array<string|int, mixed>
integer term positions
Return values
string —encoded position list
Decodes up to $num_decode term in document position integers from string $input under the assumption $input is encoded as per
decodePositionList(string $input, int $num_decode) : array<string|int, mixed>
- $input : string
string to decode from
- $num_decode : int
number of integer to decode
Return values
array<string|int, mixed> —decoded positions
Recodes a string in a 1-1 fashion to a string not involving \xFF (255). I.e., it maps characters \xFE -> \xFE\FD and \xFF -> \xFE\FE
encode255(string $str) : string
- $str : string
to be encoded
Return values
string —encoded string without \xFF
Decodes a string in a 1-1 fashion from a string not involving \xFF (255). I.e., it maps characters \xFE\FE -> \xFF and \xFE\FD -> \xFF
decode255(string $str) : string
- $str : string
to be frcoded
Return values
string —decoded string
Recodes a string in a 1-1 fashion to a string not involving underscore (_). I.e., it maps characters - -> -- and _ -> -=
encodeUnderscore(string $str) : string
- $str : string
to be encoded
Return values
string —encoded string without _
Decodes a string in a 1-1 fashion from a string not involving underscore (_). I.e., it maps characters -= -> _ and -- -> -
decodeUnderscore(string $str) : string
- $str : string
to be frcoded
Return values
string —decoded string
Encodes a list of strings as their @see encode255 versions separated by \xFF's
packEncode255(array<string|int, mixed> $strs) : string
- $strs : array<string|int, mixed>
strings to encode as a single string
Return values
string —encoded list
Decodes a list of strings from a string that encoded as their @see encode255 of its elements separated by \xFF's
unpackDecode255(string $encoded_strs) : array<string|int, mixed>
- $encoded_strs : string
string to decode into a list of strings
Return values
array<string|int, mixed> —decoded list
Makes an packed integer string from a docindex and the number of occurrences of a word in the document with that docindex.
packPosting(int $doc_index, array<string|int, mixed> $position_list[, bool $delta = true ]) : string
- $doc_index : int
index (i.e., a count of which document it is rather than a byte offset) of a document in the document string
- $position_list : array<string|int, mixed>
integer positions word occurred in that doc
- $delta : bool = true
if true then stores the position_list as a sequence of differences (a delta list)
Return values
string —a modified9 (our compression scheme) packed string containing this info.
Given a packed integer string, uses the top three bytes to calculate a doc_index of a document in the shard, and uses the low order byte to computer a number of occurrences of a word in that document.
unpackPosting(string $posting, int &$offset[, bool $dedelta = true ]) : array<string|int, mixed>
- $posting : string
a string containing a doc index position list pair coded encoded using modified9
- $offset : int
a offset into the string where the modified9 posting is encoded
- $dedelta : bool = true
if true then assumes the list is a sequence of differences (a delta list) and undoes the difference to get the original sequence
Return values
array<string|int, mixed> —consisting of integer doc_index and a subarray consisting of integer positions of word in doc.
This method is used while appending one index shard to another.
addDocIndexPostings(string &$postings, int $add_offset) : string
Given a string of postings adds $add_offset add to each offset to the document map in each posting.
- $postings : string
a string of index shard postings
- $add_offset : int
an fixed amount to add to each postings doc map offset
Return values
string —$new_postings where each doc offset has had $add_offset added to it
Computes the difference of a list of integers.
deltaList(array<string|int, mixed> $list) : array<string|int, mixed>
i.e., (a1, a2, a3, a4) becomes (a1, a2-a1, a3-a2, a4-a3)
- $list : array<string|int, mixed>
a nondecreasing list of integers
Return values
array<string|int, mixed> —the corresponding list of differences of adjacent integers
Given an array of differences of integers reconstructs the original list. This computes the inverse of the deltaList function
deDeltaList(array<string|int, mixed> &$delta_list) : array<string|int, mixed>
- $delta_list : array<string|int, mixed>
a list of nonegative integers
Return values
array<string|int, mixed> —a nondecreasing list of integers
Encodes a sequence of integers x, such that 1 <= x <= 2<<28-1 as a string. NOTICE x>=1.
encodeModified9(array<string|int, mixed> $list) : string
The encoded string is a sequence of 4 byte words (packed int's). The high order 2 bits of a given word indicate whether or not to look at the next word. The codes are as follows: 11 start of encoded string, 10 continue four more bytes, 01 end of encoded, and 00 indicates whole sequence encoded in one word.
After the high order 2 bits, the next most significant bits indicate the format of the current word. There are nine possibilities: 00 - 1 28 bit number, 01 - 2 14 bit numbers, 10 - 3 9 bit numbers, 1100 - 4 6 bit numbers, 1101 - 5 5 bit numbers, 1110 6 4 bit numbers, 11110 - 7 3 bit numbers, 111110 - 12 2 bit numbers, 111111 - 24 1 bit numbers.
- $list : array<string|int, mixed>
a list of positive integers satsfying above
Return values
string —encoded string
Packs the contents of a single word of a sequence being encoded using Modified9.
packListModified9(int $continue_bits, int $cnt, array<string|int, mixed> $pack_list) : string
- $continue_bits : int
the high order 2 bits of the word
- $cnt : int
the number of element that will be packed in this word
- $pack_list : array<string|int, mixed>
a list of positive integers to pack into word
Return values
string —encoded 4 byte string
Returns the next complete posting string from $input_string being at offset.
nextPostString(string &$input_string, int &$offset) : string
Does not do any decoding.
- $input_string : string
a string of postings
- $offset : int
an offset to this string which will be updated after call
Return values
string —undecoded posting
Decoded a sequence of positive integers from a string that has been encoded using Modified 9
decodeModified9(string $input_string, int &$offset) : array<string|int, mixed>
- $input_string : string
string to decode from
- $offset : int
where to string in the string, after decode points to where one was after decoding.
Return values
array<string|int, mixed> —sequence of positive integers that were decoded
Decode a single word with high two bits off according to modified 9
unpackListModified9(string $encoded_list) : array<string|int, mixed>
- $encoded_list : string
four byte string to decode
Return values
array<string|int, mixed> —sequence of integers that results from the decoding.
Given an int encoding encoding a doc_index followed by a position list using Modified 9, extracts just the doc_index.
docIndexModified9(int $encoded_list) : int
- $encoded_list : int
in the just described format
Return values
int —a doc index into an index shard document map.
Unpacks an int from a 4 char string
unpackInt(string $str) : int
- $str : string
where to extract int from
Return values
int —extracted integer
Packs an int into a 4 char string
packInt(int $my_int) : string
- $my_int : int
the integer to pack
Return values
string —the packed string
Unpacks a float from a 4 char string
unpackFloat(string $str) : float
- $str : string
where to extract int from
Return values
float —extracted float
Packs an float into a four char string
packFloat(float $my_float) : string
- $my_float : float
the float to pack
Return values
string —the packed string
Used to change the namespace of a serialized php object (assumes doesn't have nested subobjects)
renameSerializedObject(string $class_name, string $object_string) : string
- $class_name : string
new fully qualified name with namespace
- $object_string : string
serialized object
Return values
string —serialized object with new name
Parses a provided string to make a DOM object. First tries to parse using XML and if this fails uses the more robust HTML Dom parser and manipulates the resulting DOM tree to make correspond to original tags for XML that isn't HTML
getDomFromString(string $to_parse) : DOMDocument
- $to_parse : string
the string to parse a DOMDocument from
Return values
DOMDocument —computed based on the provided string
Returns an array of DOMDocuments for the nodes that match an xpath query on $dom, a DOMDocument
getTags(DOMDocument $dom, string $query) : array<string|int, mixed>
- $dom : DOMDocument
document to run xpath query on
- $query : string
xpath query to run
Return values
array<string|int, mixed> —of DOMDocuments one for each node matching the xpath query in the original DOMDocument
Converts a string to string where each char has been replaced by its hexadecimal equivalent
toHexString(string $str) : string
- $str : string
what we want rewritten in hex
Return values
string —the hexified string
Converts a string to string where each char has been replaced by a Integer equivalent
toIntString(string $str) : string
- $str : string
what we want rewritten in hex
Return values
string —the hexified string
Converts a string to string where each char has been replaced by its binary equivalent
toBinString(string $str) : string
- $str : string
what we want rewritten in hex
Return values
string —the binary string
Converts a string of the form some int followed by K, M, or G.
metricToInt(string $metric_num) : int
into its integer equivalent. For example 4K would become 4000, 16M would become 16000000, and 1G would become 1000000000 Note not using base 2 for K, M, G
- $metric_num : string
metric number to convert
Return values
int —number the metric string corresponded to
Converts a number to a string followed by nothing, K, M, G, T depending on whether number is < 1000, < 10^6, < 10^9, or < 10^(12)
intToMetric(int $num) : string
- $num : int
number to convert
Return values
string —number the metric string corresponded to
Logs a message to a logfile or the screen. The super-global field $_SERVER['LOG_TO_FILES'] determines if this will log to a file. If not, then in cli mode, will log to stdout, otherwise it will use error_log. When logging to file $_SERVER["NO_ROTATE_LOGS"] controls whether or not there will be a log file rotation. The first call to this method is typically used to set up a process to check for liveness. For example a call: crawlLog("\n\nInitialize logger..", $this->process_name, true); says $this->process_name should be checked for liveness as part of any subsequent logging activity such as a call crawlLog("Another Message"); (note subsequent call don't need to specify the process name).
crawlLog(string $msg[, string $lname = null ][, bool $check_process_handler = false ]) : mixed
- $msg : string
message to log. If empty then no message written
- $lname : string = null
name of log file in the LOG_DIR directory, rotated logs will also use this as their basename followed by a number followed by gzipped (since they are gzipped (older versions of Yioop used bzip Some distros don't have bzip but do have gzip. Also gzip was being used elsewhere in Yioop, so to remove the dependency bzip was replaced )).
- $check_process_handler : bool = false
by default set to false. After the first time set to true, as long as in subsequent calls set to false, processHandler will be called to check how long the code has run since the last time processHandler called.
Return values
mixed —makeTimestamp()
Used to make a log file entry time string of format: entry number, time in r format.
makeTimestamp([int $time = -1 ]) : string
- $time : int = -1
a unix timestamp
Return values
string —[line_count_in_log r_formatted_date]
Writes a log message $msg if more than LOG_TIMEOUT time has passed since the last time crawlTimeoutLog was called. Useful in loops to write a message as progress is made through the loop (but not on every iteration, but say every 30 seconds).
crawlTimeoutLog(mixed $msg) : bool
- $msg : mixed
usually a string with what to be printed out after the timeout period. If $msg === true then clears the timeout cache
Return values
bool —whether a log message was written
Computes an 8 byte hash of a string for use in storing documents.
crawlHash(string $string[, bool $raw = false ]) : string
An eight byte hash was chosen so that the odds of collision even for a few billion documents via the birthday problem are still reasonable. If the raw flag is set to false then an 11 byte base64 encoding of the 8 byte hash is returned. The hash is calculated as the xor of the two halves of the 16 byte md5 of the string. (8 bytes takes less storage which is useful for keeping more doc info in memory)
- $string : string
the string to hash
- $raw : bool = false
whether to leave raw or base 64 encode
Return values
string —the hash of $string
Used to create a 20 byte hash of a string (typically a word or phrase with a wikipedia page). Format is 8 byte crawlHash of term (md5 of term two halves XOR'd), followed by a \x00, followed by the first 11 characters from the term. If there are not enough char's to make 20 bytes, then the string is padded with \x00s to 20bytes.
crawlHashWord(string $string[, bool $raw = false ]) : string
- $string : string
word to hash
- $raw : bool = false
whether to base64Hash the result
Return values
string —first 8 bytes of md5 of $string concatenated with \x00 to indicate the hash is of a word not a phrase concatenated with the padded to 11 byte $meta_string.
Take a $term that might have come from adocuments and converts it to a string of 16 bytes which is either the original term padded by underscores or the first seven chars of the term followed by an underscore followed by the base64 encoding of the first 6 chars of its md5 hash.
canonicalTerm(string $term) : string
Base64 used to make this all nice and printable.
- $term : string
to made into a canonical form
Return values
string —canonicalize by apbove version of term.
Used to compare to ids for index dictionary lookup. ids are a 8 byte crawlHash together with 12 byte non-hash suffix.
compareWordHashes(string $id1, string $id2) : int
- $id1 : string
20 byte word id to compare
- $id2 : string
20 byte word id to compare
Return values
int —negative if $id1 smaller, positive if bigger, and 0 if same
Converts a crawl hash number to something closer to base64 coded but so doesn't get confused in urls or DBs
base64Hash(string $string) : string
- $string : string
a hash to base64 encode
Return values
string —the encoded hash
Decodes a crawl hash number from base64 to raw ASCII
unbase64Hash(string $base64) : string
- $base64 : string
a hash to decode
Return values
string —the decoded hash
Encodes a string in a format suitable for post data (mainly, base64, but str_replace data that might mess up post in result)
webencode(string $str) : string
- $str : string
string to encode
Return values
string —encoded string
Decodes a string encoded by webencode
webdecode(string $str) : string
- $str : string
string to encode
Return values
string —encoded string
The crawlHash function is used to encrypt passwords stored in the database.
crawlCrypt(string $string[, int $salt = null ]) : string
It tries to use the best version the Blowfish variant of php's crypt function available on the current system.
- $string : string
the string to encrypt
- $salt : int = null
salt value to be used (needed to verify if a password is valid)
Return values
string —the crypted string where crypting is done using crawlHash
Used by a controller to take a table and return those rows in the table that a given queue_server would be responsible for handling
partitionByHash(array<string|int, mixed> $table, string $field, int $num_partition, int $instance[, object $callback = null ]) : array<string|int, mixed>
- $table : array<string|int, mixed>
an array of rows of associative arrays which a queue_server might need to process
- $field : string
column of $table whose values should be used for partitioning
- $num_partition : int
number of queue_servers to choose between
- $instance : int
the id of the particular server we are interested in
- $callback : object = null
function or static method that might be applied to input before deciding the responsible queue_server. For example, if input was a url we might want to get the host before deciding on the queue_server
Return values
array<string|int, mixed> —the reduced table that the $instance queue_server is responsible for
Used by a controller to say which queue_server should receive a given input
calculatePartition(string $input, int $num_partition[, object $callback = null ]) : int
- $input : string
can view as a key that might be processes by a queue_server. For example, in some cases input might be a url and we want to determine which queue_server should be responsible for queuing that url
- $num_partition : int
number of queue_servers to choose between
- $callback : object = null
function or static method that might be applied to input before deciding the responsible queue_server. For example, if the input was a url we might want to get the host before deciding on the queue_server
Return values
int —id of server responsible for input
Measures the change in time in seconds between two timestamps to microsecond precision
changeInMicrotime(string $start[, string $end = null ]) : float
- $start : string
starting time with microseconds
- $end : string = null
ending time with microseconds, if null use current time
Return values
float —time difference in seconds
Timestamp of current epoch with microsecond precision useful for situations where time() might cause too many collisions (account creation, etc)
microTimestamp() : string
Return values
string —timestamp to microsecond of time in second since start of current epoch
Checks that a timestamp is within the time interval given by a start time (HH:mm) and a duration
checkTimeInterval(string $start_time, string $duration[, int $time = -1 ]) : int
- $start_time : string
string of the form (HH:mm)
- $duration : string
string containing an int in seconds
- $time : int = -1
a Unix timestamp.
Return values
int —-1 if the time of day of $time is not within the given interval. Otherwise, the Unix timestamp at which the interval will be over for the same day as $time.
Converts a CSS unit string into its equivalent in pixels. This is used by @see SvgProcessor.
convertPixels(string $value) : int
- $value : string
a number followed by a legal CSS unit
Return values
int —a number in pixels
Returns the number of files in a folder
countFiles(string $folder) : int
- $folder : string
path to folder to count
Return values
int —number of files
Creates folders along a filesystem path if they don't exist
makePath(string $path) : bool
- $path : string
a file system path
Return values
bool —success or failure
This is a callback function used in the process of recursively deleting a directory
deleteFileOrDir(string $file_or_dir) : mixed
- $file_or_dir : string
the filename or directory name to be deleted
Return values
mixed —setWorldPermissions()
This is a callback function used in the process of recursively chmoding to 777 all files in a folder
setWorldPermissions(string $file) : mixed
- $file : string
the filename or directory name to be chmod
Return values
mixed —fileInfo()
This is a callback function used in the process of recursively calculating an array of file modification times and files sizes for a directory
fileInfo(string $file) : an
- $file : string
a name of a file in the file system
Return values
an —array whose single element contain an associative array with the size and modification time of the file
Callback function used to sort documents by a field
orderCallback(string $word_doc_a, string $word_doc_b[, string $order_field = null ]) : int
Should be initialized before using in usort with a call like: orderCallback($tmp, $tmp, "field_want");
- $word_doc_a : string
doc id of first document to compare
- $word_doc_b : string
doc id of second document to compare
- $order_field : string = null
which field of these associative arrays to sort by
Return values
int —-1 if first doc bigger 1 otherwise
Callback function used to sort documents by a field where field is assume to be a string
stringOrderCallback(string $word_doc_a, string $word_doc_b[, string $order_field = null ]) : int
Should be initialized before using in usort with a call like: stringOrderCallback($tmp, $tmp, "field_want");
- $word_doc_a : string
doc id of first document to compare
- $word_doc_b : string
doc id of second document to compare
- $order_field : string = null
which field of these associative arrays to sort by
Return values
int —-1 if first doc smaller 1 otherwise
Callback function used to sort documents by a field where field is assume to be a string
stringROrderCallback(string $word_doc_a, string $word_doc_b[, string $order_field = null ]) : int
Should be initialized before using in usort with a call like: stringROrderCallback($tmp, $tmp, "field_want");
- $word_doc_a : string
doc id of first document to compare
- $word_doc_b : string
doc id of second document to compare
- $order_field : string = null
which field of these associative arrays to sort by
Return values
int —-1 if first doc bigger 1 otherwise
Callback function used to sort documents by a field in reverse order
rorderCallback(string $word_doc_a, string $word_doc_b[, string $order_field = null ]) : int
Should be initialized before using in usort with a call like: rorderCallback($tmp, $tmp, "field_want");
- $word_doc_a : string
doc id of first document to compare
- $word_doc_b : string
doc id of second document to compare
- $order_field : string = null
which field of these associative arrays to sort by
Return values
int —1 if first doc bigger -1 otherwise
Callback to check if $a is less than $b
lessThan(float $a, float $b) : int
Used to help sort document results returned in PhraseModel called in IndexArchiveBundle
- $a : float
first value to compare
- $b : float
second value to compare
Return values
int —-1 if $a is less than $b; 1 otherwise
Callback to check if $a is greater than $b
greaterThan(float $a, float $b) : int
Used to help sort document results returned in PhraseModel called in IndexArchiveBundle
- $a : float
first value to compare
- $b : float
second value to compare
Return values
int —-1 if $a is greater than $b; 1 otherwise
shorthand for echo
e(string $text) : mixed
- $text : string
string to send to the current output
Return values
mixed —remoteAddress()
Compute the real remote address of the incoming connection including forwarding
remoteAddress() : mixed
Return values
mixed —readInput()
Used to read a line of input from the command-line
readInput() : string
Return values
string —from the command-line
Used to read a line of input from the command-line (on unix machines without echoing it)
readPassword() : string
Return values
string —from the command-line
Used to read a several lines from the terminal up until a last line consisting of just a "."
readMessage() : string
Return values
string —from the command-line
Returns the mime type of the provided file name if it can be determined.
mimeType(string $file_name[, bool $use_extension = false ]) : string
- $file_name : string
(name of file including path to figure out mime type for)
- $use_extension : bool = false
whether to just try to guess from the file extension rather than looking at the file
Return values
string —mime type or unknown if can't be determined
Checks if class_1 is the same as class_2 or has class_2 as a parent Behaves like 3 param version (last param true) of PHP is_a function that came into being with Version 5.3.9.
generalIsA(mixed $class_1, mixed $class_2) : bool
- $class_1 : mixed
object or string class name to see if in class2
- $class_2 : mixed
object or string class name to see if contains class1
Return values
bool —equal or contains class
Given the contents of a start XML/HMTL tag strips out all the attributes non listed in $safe_attribute_list
stripAttributes(string $start_tag_contents[, array<string|int, mixed> $safe_attribute_list = [] ]) : string
- $start_tag_contents : string
the contents of an HTML/XML tag. I.e., if the tag was <tag stuff> then $start_tag_contents could be stuff
- $safe_attribute_list : array<string|int, mixed> = []
a list of attributes which should be kept
Return values
string —containing only safe attributes and their values
Used to parse into a two dimensional array a string that contains CSV data.
parseCsv(string $csv_string) : array<string|int, mixed>
- $csv_string : string
string with csv data
Return values
array<string|int, mixed> —two dimensional array of elements from csv
Converts an array of values to a comma separated value formatted string.
arraytoCsv(array<string|int, mixed> $arr) : string
- $arr : array<string|int, mixed>
values to convert
Return values
string —CSV string after conversion
Computes a Unix-style diff of two strings. That is it only outputs lines which disagree between the two strings. It outputs +line if a line occurs in the second but not first string and -line if a line occurs in the first string but not the second.
diff(string $data1, string $data2[, bool $html = false ]) : string
- $data1 : string
first string to compare
- $data2 : string
second string to compare
- $html : bool = false
whether to output html highlighting
Return values
string —representing info about where $data1 and $data2 don't match
Computes the longest common subsequence of two arrays
computeLCS(array<string|int, mixed> $lines1, array<string|int, mixed> $lines2, int $offset) : mixed
- $lines1 : array<string|int, mixed>
an array of lines to compute LCS of
- $lines2 : array<string|int, mixed>
an array of lines to compute LCS of
- $offset : int
an offset to shift over array addresses in output by
Return values
mixed —extractLCSFromTable()
Extracts from a table of longest common sequence moves (probably calculated by @see computeLCS) and a starting coordinate $i, $j in that table, a longest common subsequence
extractLCSFromTable(array<string|int, mixed> $lcs_moves, array<string|int, mixed> $lines, int $i, int $j, int $offset, array<string|int, mixed> &$lcs) : mixed
- $lcs_moves : array<string|int, mixed>
a table of move computed by computeLCS
- $lines : array<string|int, mixed>
from first of the two arrays computing LCS of
- $i : int
a line number in string 1
- $j : int
a line number in string 2
- $offset : int
a number to add to each line number output into $lcs. This is useful if we have trimmed off the initially common lines from our two strings we are trying to compute the LCS of
- $lcs : array<string|int, mixed>
an array of triples (index_string1, index_string2, line) the indexes indicate the line number in each string, line is the line in common the two strings
Return values
mixed —tail()
Returns an array of the last $num_lines many lines our of a file
tail(string $file_name, string $num_lines) : array<string|int, mixed>
- $file_name : string
name of file to return lines from
- $num_lines : string
number of lines to retrieve
Return values
array<string|int, mixed> —retrieved lines
Given an array of lines returns a subarray of those lines containing the filter string or filter array
lineFilter(string $lines, mixed $filters[, bool $case_insensitive = true ]) : array<string|int, mixed>
- $lines : string
to search
- $filters : mixed
either string to filter lines with or an array of strings (any of which can be present to pass the filter)
- $case_insensitive : bool = true
whether search should be done case insensitively or not.
Return values
array<string|int, mixed> —lines containing the string
Tries to extract a timestamp from a line which is presumed to come from a Yioop log file
logLineTimestamp(string $line) : int
- $line : string
to search
Return values
int —timestamp of that log entry
Returns whether an input can be parsed to a positive integer
isPositiveInteger(mixed $input) : bool
- $input : mixed
Return values
bool —whether $input can be parsed to a positive integer.
Used to measure the memory footprint in bytes and time spent calling a method of an object. It also records number of time the method has been called.
measureCall(object $object, string $method[, mixed $arguments = [] ][, string $call_name = "" ]) : mixed
Just calls the method without any recording or timing until an initial call to the function measureCall(null, save_statistics_file) where save_statistics_file is the name of the file you won't to store statistics to.
- $object : object
name of object whose method we want to call and measure
- $method : string
method we're calling
- $arguments : mixed = []
- $call_name : string = ""
name to use when outputting stats for this call, defaults to $method.
Return values
mixed —whatever method would normally returned when called as above
Used to measure the memory footprint of an object in Yioop and save it to a statistics file No recording is done until an initial call to the function measureCall(null, save_statistics_file) where save_statistics_file is the name of the file you won't to store statistics to.
measureObject(object $object[, string $save_file = "" ][, mixed $class_name = "" ]) : mixed
- $object : object
name of object whose size we want to measure
- $save_file : string = ""
statistics file to write info to
- $class_name : mixed = ""
Return values
mixed —measureObjectCall()
General method called by for @see measureCall and @see measureObject Used to measure the memory footprint in bytes of an object or memory and time spent calling a method of an object. It also records number of time the method has been called. When used to call a method before initialization, just calls the method without any recording or timing. To initialize, an initial call to the function measureCall(null, save_statistics_file) where save_statistics_file is the name of the file you won't to store statistics to should be done.
measureObjectCall(object $object, string $method[, mixed $arguments = [] ][, string $call_name = "" ]) : mixed
- $object : object
name of object whose method we want to call and measure
- $method : string
method we're calling
- $arguments : mixed = []
- $call_name : string = ""
name to use when outputting stats for this call, defaults to $method.
Return values
mixed —whatever method would normally returned when called as above
Makes a deep copy of a variable regardless of its type
variableClone(mixed $var) : mixed
- $var : mixed
variable to deep copy
Return values
mixed —the deep copy
Runs various system garbage collection functions and returns number of bytes freed.
garbageCollect() : int
Return values
int —number of bytes freed
The dom method saveHTML has a tendency to replace UTF-8, non-ascii characters with html entities. This is supposed to save avoiding the replacement.
utf8SafeSaveHtml(DOMDocument $dom) : string
What it does is to first save the dom, then it replaces htmlentities of the form &single_char; or &#some_number; with the UTF-8 they correspond to. It leaves all other entities as they are
- $dom : DOMDocument
Return values
string —output of saving html
A UTF-8 safe version of PHP's wordwrap function that wraps a string to a given number of characters
utf8WordWrap(string $string[, int $width = 75 ][, string $break = "
" ][, bool $cut = false ]) : string
- $string : string
the input string
- $width : int = 75
the number of characters at which the string will be wrapped
- $break : string = " "
string used to break a line into two
- $cut : bool = false
whether to always force wrap at $width characters even if word hasn't ended
Return values
string —the given string wrapped at the specified length