Yioop_V9.5_Source_Code_Documentation

FetchGitRepositoryUrls
in package
implements CrawlConstants

Library of functions used to fetch Git internal urls

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

BLOB_ACCESS_CODE_END  = 6
Git blob access code ending position
BLOB_ACCESS_CODE_START  = 0
Git blob access code starting position
CURL_TIMEOUT  = 5
A cURL time out parameter
CURL_TRANSFER  = 1
A cURL transfer parameter
GIT_BASE_END_LETTER  = 1
A fixed indicator used to get last letter of git base url
GIT_BASE_URL_END  = '###'
An indicator to tell ending position of Git url to be used
GIT_BASE_URL_END_POSITION  = -1
A fixed indicator used to get last letter of git base url
GIT_BASE_URL_START  = 0
An indicator to tell starting position of Git url to be used
GIT_BLOB_INDICATOR  = '100'
A indicator to represent that a git file is a blob file
GIT_BLOB_NEXT  = 7
A indicator to represent next position after the access code in Git blob object
GIT_BLOB_OBJECT  = "blob"
A fixed indicator used to indicate Git blob object
GIT_FILE_NAME_END  = 38
A fixed indicator used to mark ending position of SHA hash used to indicate Git object file
GIT_FILE_NAME_START  = 2
A fixed indicator used to mark starting position of SHA hash used to indicate Git object file
GIT_FOLDER_NAME_END  = 2
A fixed indicator used to mark ending position of SHA hash used to indicate Git object folder
GIT_FOLDER_NAME_START  = 0
A fixed indicator used to mark starting position of SHA hash used to indicate Git object folder
GIT_MASTER_TREE_HASH_END  = 41
A fixed indicator used to mark ending position of SHA hash of Git master tree
GIT_MASTER_TREE_HASH_START  = 16
A fixed indicator used to mark starting position of SHA hash of Git master tree
GIT_NAME_START  = 0
A indicator for starting of Git file or folder name
GIT_NEXT_URL_END  = 40
A fixed position used to indicate ending position to fetch next Git url from the master file
GIT_NEXT_URL_START  = 0
A fixed position used to indicate starting point to fetch next Git url from the master file
GIT_TREE_INDICATOR  = '400'
A indicator to represent that a git file is a tree file
GIT_TREE_NEXT  = 6
A indicator to represent next position after the access code in Git tree object
GIT_TREE_OBJECT  = "tree"
A fixed indicator used to indicate Git tree object
GIT_URL_CONTINUE  = '@@@@'
An indicator to tell more git urls need to be fetched
GIT_URL_EXTENSION  = 'info/refs?service=git-upload-pack'
A fixed component to be used with Git base url to form Git first url
GIT_URL_OBJECT  = 'objects/'
A fixed component to be used with Git urls to get next Git urls
GIT_URL_SPLIT  = '/'
A fixed indicator used to make desired Git folder structure from SHA hash
HEX_NULL_CHARACTER  = "\x00"
A indicator to represent next position after the access code in Git tree object
INDICATOR_GIT  = 'git'
An indicator to indicate git repository
INDICATOR_NONE  = 'none'
An indicator to tell no actions to be taken
SHA_HASH_BINARY_END  = 20
Git SHA hash binary ending position
SHA_HASH_BINARY_START  = 0
Git SHA hash binary starting position
TREE_ACCESS_CODE_END  = 5
Git tree access code ending position
TREE_ACCESS_CODE_START  = 0
Git tree access code starting position
$all_git_urls  : array<string|int, mixed>
An array used to store all the Git internal urls
$repository_types  : array<string|int, mixed>
A list of meta words that might be extracted from a query
checkForRepository()  : string
Checks repository type based on extension
checkNestedStructure()  : string
Checks the nested structure inside git tree object
checkPosition()  : array<string|int, mixed>
checks the position of access code for null values
fetchGitRepositoryUrl()  : an
Get the Git internal urls from the parent Git url
getGitData()  : string
Makes the cURL call to get the contents
getGitMasterFile()  : string
Get the Git second url which points to Git master tree structure
getGitMasterTree()  : string
Get the Git third url which contains the information about the organization of entire git repository
getNextGitUrl()  : string
Get the Git content from url which will be used to get the next git url
getObjects()  : array<string|int, mixed>
Get the Git blob and tree objects
readBlobSha()  : array<string|int, mixed>
Get the details of the blob file i.e blob file name, sha hash and content
readTreeSha()  : array<string|int, mixed>
Get the details of the tree file i.e folder name, sha hash and blob url inside the tree
setGitRepositoryUrl()  : array<string|int, mixed>
Sets up the seed sites with urls from a git repository (updates these sites if have already started downloading from repository)
urlMaker()  : string
Makes the git clone internal url for blob objects

Constants

BLOB_ACCESS_CODE_END

Git blob access code ending position

public mixed BLOB_ACCESS_CODE_END = 6

BLOB_ACCESS_CODE_START

Git blob access code starting position

public mixed BLOB_ACCESS_CODE_START = 0

GIT_BASE_END_LETTER

A fixed indicator used to get last letter of git base url

public mixed GIT_BASE_END_LETTER = 1

GIT_BASE_URL_END

An indicator to tell ending position of Git url to be used

public mixed GIT_BASE_URL_END = '###'

GIT_BASE_URL_END_POSITION

A fixed indicator used to get last letter of git base url

public mixed GIT_BASE_URL_END_POSITION = -1

GIT_BASE_URL_START

An indicator to tell starting position of Git url to be used

public mixed GIT_BASE_URL_START = 0

GIT_BLOB_INDICATOR

A indicator to represent that a git file is a blob file

public mixed GIT_BLOB_INDICATOR = '100'

GIT_BLOB_NEXT

A indicator to represent next position after the access code in Git blob object

public mixed GIT_BLOB_NEXT = 7

GIT_BLOB_OBJECT

A fixed indicator used to indicate Git blob object

public mixed GIT_BLOB_OBJECT = "blob"

GIT_FILE_NAME_END

A fixed indicator used to mark ending position of SHA hash used to indicate Git object file

public mixed GIT_FILE_NAME_END = 38

GIT_FILE_NAME_START

A fixed indicator used to mark starting position of SHA hash used to indicate Git object file

public mixed GIT_FILE_NAME_START = 2

GIT_FOLDER_NAME_END

A fixed indicator used to mark ending position of SHA hash used to indicate Git object folder

public mixed GIT_FOLDER_NAME_END = 2

GIT_FOLDER_NAME_START

A fixed indicator used to mark starting position of SHA hash used to indicate Git object folder

public mixed GIT_FOLDER_NAME_START = 0

GIT_MASTER_TREE_HASH_END

A fixed indicator used to mark ending position of SHA hash of Git master tree

public mixed GIT_MASTER_TREE_HASH_END = 41

GIT_MASTER_TREE_HASH_START

A fixed indicator used to mark starting position of SHA hash of Git master tree

public mixed GIT_MASTER_TREE_HASH_START = 16

GIT_NAME_START

A indicator for starting of Git file or folder name

public mixed GIT_NAME_START = 0

GIT_NEXT_URL_END

A fixed position used to indicate ending position to fetch next Git url from the master file

public mixed GIT_NEXT_URL_END = 40

GIT_NEXT_URL_START

A fixed position used to indicate starting point to fetch next Git url from the master file

public mixed GIT_NEXT_URL_START = 0

GIT_TREE_INDICATOR

A indicator to represent that a git file is a tree file

public mixed GIT_TREE_INDICATOR = '400'

GIT_TREE_NEXT

A indicator to represent next position after the access code in Git tree object

public mixed GIT_TREE_NEXT = 6

GIT_TREE_OBJECT

A fixed indicator used to indicate Git tree object

public mixed GIT_TREE_OBJECT = "tree"

GIT_URL_CONTINUE

An indicator to tell more git urls need to be fetched

public mixed GIT_URL_CONTINUE = '@@@@'

GIT_URL_EXTENSION

A fixed component to be used with Git base url to form Git first url

public mixed GIT_URL_EXTENSION = 'info/refs?service=git-upload-pack'

GIT_URL_OBJECT

A fixed component to be used with Git urls to get next Git urls

public mixed GIT_URL_OBJECT = 'objects/'

GIT_URL_SPLIT

A fixed indicator used to make desired Git folder structure from SHA hash

public mixed GIT_URL_SPLIT = '/'

HEX_NULL_CHARACTER

A indicator to represent next position after the access code in Git tree object

public mixed HEX_NULL_CHARACTER = "\x00"

SHA_HASH_BINARY_START

Git SHA hash binary starting position

public mixed SHA_HASH_BINARY_START = 0

TREE_ACCESS_CODE_END

Git tree access code ending position

public mixed TREE_ACCESS_CODE_END = 5

TREE_ACCESS_CODE_START

Git tree access code starting position

public mixed TREE_ACCESS_CODE_START = 0

Properties

$all_git_urls

An array used to store all the Git internal urls

public static array<string|int, mixed> $all_git_urls

$repository_types

A list of meta words that might be extracted from a query

public static array<string|int, mixed> $repository_types = ['git' => 'git', 'svn' => 'svn', 'cvs' => 'cvs', 'vss' => 'vss', 'mercurial' => 'mercurial', 'monotone' => 'monotone', 'bazaar' => 'bazaar', 'darcs' => 'darcs', 'arch' => 'arch']

Methods

checkForRepository()

Checks repository type based on extension

public static checkForRepository(string $extension) : string
Parameters
$extension : string

to check

Return values
string

$repository_type repository type based on the extension of urls

checkNestedStructure()

Checks the nested structure inside git tree object

public static checkNestedStructure(string $sha_hash, string $git_base_url) : string
Parameters
$sha_hash : string

sha of the git tree object

$git_base_url : string

common portion of the parent git url

Return values
string

$blob_url contains url of the blob file inside the folder

checkPosition()

checks the position of access code for null values

public static checkPosition(string $git_blob_position, string $git_tree_position, string $git_object_content) : array<string|int, mixed>
Parameters
$git_blob_position : string

first occuence of git blob access code

$git_tree_position : string

first occuence of git tree access code

$git_object_content : string

compressed content of git master tree

Return values
array<string|int, mixed>

$git_object_positions length of the compressed content afterthe access code

fetchGitRepositoryUrl()

Get the Git internal urls from the parent Git url

public static fetchGitRepositoryUrl(string $url_to_check) : an
Parameters
$url_to_check : string

url needs to be processed

Return values
an

array $git_next_urls consists of list of Git internal urls which are called during the git clone

getGitData()

Makes the cURL call to get the contents

public static getGitData(string $git_url) : string
Parameters
$git_url : string

url to dowmload the contents

Return values
string

$git_content actual content of the git url

getGitMasterFile()

Get the Git second url which points to Git master tree structure

public static getGitMasterFile(string $git_first_url_content, string $git_base_url) : string
Parameters
$git_first_url_content : string

contents of Git first url

$git_base_url : string

common portion of Git urls

Return values
string

$git_next_url consists of second internal Git url

getGitMasterTree()

Get the Git third url which contains the information about the organization of entire git repository

public static getGitMasterTree(string $git_second_url_content, string $git_base_url) : string
Parameters
$git_second_url_content : string

contents of Git second url

$git_base_url : string

common portion of git urls

Return values
string

$git_next_url consists of third internal git url

getNextGitUrl()

Get the Git content from url which will be used to get the next git url

public static getNextGitUrl(string $git_url, string $compression_indicator) : string
Parameters
$git_url : string

git url to extract contents from it

$compression_indicator : string

indicator for compress and uncompress contents

Return values
string

$git_object_content consists contents extracted from the url

getObjects()

Get the Git blob and tree objects

public static getObjects(string $git_object_content, string $git_base_url) : array<string|int, mixed>
Parameters
$git_object_content : string

compressed content of git master tree file

$git_base_url : string

common content of git url

Return values
array<string|int, mixed>

$blob_url contains information and url for git blob objects

readBlobSha()

Get the details of the blob file i.e blob file name, sha hash and content

public static readBlobSha(string $git_object_content, string $blob_position, string $length, string $git_base_url) : array<string|int, mixed>
Parameters
$git_object_content : string

compressed content of git master tree

$blob_position : string

first occuence of git blob access code in $content

$length : string

length of the compressed content of git master tree

$git_base_url : string

common portion of git url

Return values
array<string|int, mixed>

$git_blob_content contains details of git blob object

readTreeSha()

Get the details of the tree file i.e folder name, sha hash and blob url inside the tree

public static readTreeSha(string $git_object_content, string $tree_position, string $length, string $git_base_url) : array<string|int, mixed>
Parameters
$git_object_content : string

compressed content of git master tree

$tree_position : string

first occuence of git tree access code in the $content

$length : string

length of the compressed content of git master tree

$git_base_url : string

common portion of git url

Return values
array<string|int, mixed>

$git_tree_content contains details of git blob object

setGitRepositoryUrl()

Sets up the seed sites with urls from a git repository (updates these sites if have already started downloading from repository)

public static setGitRepositoryUrl(string $url_to_check, int $counter, array<string|int, mixed> $seeds, array<string|int, mixed> $repository_indicator, array<string|int, mixed> $site_value, int $total_git_urls, array<string|int, mixed> $all_git_urls) : array<string|int, mixed>
Parameters
$url_to_check : string

url needs to be processed

$counter : int

to keep track of number of urls processed

$seeds : array<string|int, mixed>

store sites which are ready to be downloaded

$repository_indicator : array<string|int, mixed>

indicates the type of the repository

$site_value : array<string|int, mixed>

contains original Git url crawled

$total_git_urls : int

number of urls in repository less those already processed

$all_git_urls : array<string|int, mixed>

current list of urls from git repository

Return values
array<string|int, mixed>

$git_internal_urls containing all the internal Git urls fetched from the parent Git url

urlMaker()

Makes the git clone internal url for blob objects

public static urlMaker(string $sha_hash, string $git_base_url) : string
Parameters
$sha_hash : string

of the git blob object

$git_base_url : string

common portion of git url

Return values
string

$git_object_url contains the complete url of the blob file


        

Search results