FetchGitRepositoryUrls
in package
implements
CrawlConstants
Library of functions used to fetch Git internal urls
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- BLOB_ACCESS_CODE_END = 6
- Git blob access code ending position
- BLOB_ACCESS_CODE_START = 0
- Git blob access code starting position
- CURL_TIMEOUT = 5
- A cURL time out parameter
- CURL_TRANSFER = 1
- A cURL transfer parameter
- GIT_BASE_END_LETTER = 1
- A fixed indicator used to get last letter of git base url
- GIT_BASE_URL_END = '###'
- An indicator to tell ending position of Git url to be used
- GIT_BASE_URL_END_POSITION = -1
- A fixed indicator used to get last letter of git base url
- GIT_BASE_URL_START = 0
- An indicator to tell starting position of Git url to be used
- GIT_BLOB_INDICATOR = '100'
- A indicator to represent that a git file is a blob file
- GIT_BLOB_NEXT = 7
- A indicator to represent next position after the access code in Git blob object
- GIT_BLOB_OBJECT = "blob"
- A fixed indicator used to indicate Git blob object
- GIT_FILE_NAME_END = 38
- A fixed indicator used to mark ending position of SHA hash used to indicate Git object file
- GIT_FILE_NAME_START = 2
- A fixed indicator used to mark starting position of SHA hash used to indicate Git object file
- GIT_FOLDER_NAME_END = 2
- A fixed indicator used to mark ending position of SHA hash used to indicate Git object folder
- GIT_FOLDER_NAME_START = 0
- A fixed indicator used to mark starting position of SHA hash used to indicate Git object folder
- GIT_MASTER_TREE_HASH_END = 41
- A fixed indicator used to mark ending position of SHA hash of Git master tree
- GIT_MASTER_TREE_HASH_START = 16
- A fixed indicator used to mark starting position of SHA hash of Git master tree
- GIT_NAME_START = 0
- A indicator for starting of Git file or folder name
- GIT_NEXT_URL_END = 40
- A fixed position used to indicate ending position to fetch next Git url from the master file
- GIT_NEXT_URL_START = 0
- A fixed position used to indicate starting point to fetch next Git url from the master file
- GIT_TREE_INDICATOR = '400'
- A indicator to represent that a git file is a tree file
- GIT_TREE_NEXT = 6
- A indicator to represent next position after the access code in Git tree object
- GIT_TREE_OBJECT = "tree"
- A fixed indicator used to indicate Git tree object
- GIT_URL_CONTINUE = '@@@@'
- An indicator to tell more git urls need to be fetched
- GIT_URL_EXTENSION = 'info/refs?service=git-upload-pack'
- A fixed component to be used with Git base url to form Git first url
- GIT_URL_OBJECT = 'objects/'
- A fixed component to be used with Git urls to get next Git urls
- GIT_URL_SPLIT = '/'
- A fixed indicator used to make desired Git folder structure from SHA hash
- HEX_NULL_CHARACTER = "\x00"
- A indicator to represent next position after the access code in Git tree object
- INDICATOR_GIT = 'git'
- An indicator to indicate git repository
- INDICATOR_NONE = 'none'
- An indicator to tell no actions to be taken
- SHA_HASH_BINARY_END = 20
- Git SHA hash binary ending position
- SHA_HASH_BINARY_START = 0
- Git SHA hash binary starting position
- TREE_ACCESS_CODE_END = 5
- Git tree access code ending position
- TREE_ACCESS_CODE_START = 0
- Git tree access code starting position
- $all_git_urls : array<string|int, mixed>
- An array used to store all the Git internal urls
- $repository_types : array<string|int, mixed>
- A list of meta words that might be extracted from a query
- checkForRepository() : string
- Checks repository type based on extension
- checkNestedStructure() : string
- Checks the nested structure inside git tree object
- checkPosition() : array<string|int, mixed>
- checks the position of access code for null values
- fetchGitRepositoryUrl() : an
- Get the Git internal urls from the parent Git url
- getGitData() : string
- Makes the cURL call to get the contents
- getGitMasterFile() : string
- Get the Git second url which points to Git master tree structure
- getGitMasterTree() : string
- Get the Git third url which contains the information about the organization of entire git repository
- getNextGitUrl() : string
- Get the Git content from url which will be used to get the next git url
- getObjects() : array<string|int, mixed>
- Get the Git blob and tree objects
- readBlobSha() : array<string|int, mixed>
- Get the details of the blob file i.e blob file name, sha hash and content
- readTreeSha() : array<string|int, mixed>
- Get the details of the tree file i.e folder name, sha hash and blob url inside the tree
- setGitRepositoryUrl() : array<string|int, mixed>
- Sets up the seed sites with urls from a git repository (updates these sites if have already started downloading from repository)
- urlMaker() : string
- Makes the git clone internal url for blob objects
Constants
BLOB_ACCESS_CODE_END
Git blob access code ending position
public
mixed
BLOB_ACCESS_CODE_END
= 6
BLOB_ACCESS_CODE_START
Git blob access code starting position
public
mixed
BLOB_ACCESS_CODE_START
= 0
CURL_TIMEOUT
A cURL time out parameter
public
mixed
CURL_TIMEOUT
= 5
CURL_TRANSFER
A cURL transfer parameter
public
mixed
CURL_TRANSFER
= 1
GIT_BASE_END_LETTER
A fixed indicator used to get last letter of git base url
public
mixed
GIT_BASE_END_LETTER
= 1
GIT_BASE_URL_END
An indicator to tell ending position of Git url to be used
public
mixed
GIT_BASE_URL_END
= '###'
GIT_BASE_URL_END_POSITION
A fixed indicator used to get last letter of git base url
public
mixed
GIT_BASE_URL_END_POSITION
= -1
GIT_BASE_URL_START
An indicator to tell starting position of Git url to be used
public
mixed
GIT_BASE_URL_START
= 0
GIT_BLOB_INDICATOR
A indicator to represent that a git file is a blob file
public
mixed
GIT_BLOB_INDICATOR
= '100'
GIT_BLOB_NEXT
A indicator to represent next position after the access code in Git blob object
public
mixed
GIT_BLOB_NEXT
= 7
GIT_BLOB_OBJECT
A fixed indicator used to indicate Git blob object
public
mixed
GIT_BLOB_OBJECT
= "blob"
GIT_FILE_NAME_END
A fixed indicator used to mark ending position of SHA hash used to indicate Git object file
public
mixed
GIT_FILE_NAME_END
= 38
GIT_FILE_NAME_START
A fixed indicator used to mark starting position of SHA hash used to indicate Git object file
public
mixed
GIT_FILE_NAME_START
= 2
GIT_FOLDER_NAME_END
A fixed indicator used to mark ending position of SHA hash used to indicate Git object folder
public
mixed
GIT_FOLDER_NAME_END
= 2
GIT_FOLDER_NAME_START
A fixed indicator used to mark starting position of SHA hash used to indicate Git object folder
public
mixed
GIT_FOLDER_NAME_START
= 0
GIT_MASTER_TREE_HASH_END
A fixed indicator used to mark ending position of SHA hash of Git master tree
public
mixed
GIT_MASTER_TREE_HASH_END
= 41
GIT_MASTER_TREE_HASH_START
A fixed indicator used to mark starting position of SHA hash of Git master tree
public
mixed
GIT_MASTER_TREE_HASH_START
= 16
GIT_NAME_START
A indicator for starting of Git file or folder name
public
mixed
GIT_NAME_START
= 0
GIT_NEXT_URL_END
A fixed position used to indicate ending position to fetch next Git url from the master file
public
mixed
GIT_NEXT_URL_END
= 40
GIT_NEXT_URL_START
A fixed position used to indicate starting point to fetch next Git url from the master file
public
mixed
GIT_NEXT_URL_START
= 0
GIT_TREE_INDICATOR
A indicator to represent that a git file is a tree file
public
mixed
GIT_TREE_INDICATOR
= '400'
GIT_TREE_NEXT
A indicator to represent next position after the access code in Git tree object
public
mixed
GIT_TREE_NEXT
= 6
GIT_TREE_OBJECT
A fixed indicator used to indicate Git tree object
public
mixed
GIT_TREE_OBJECT
= "tree"
GIT_URL_CONTINUE
An indicator to tell more git urls need to be fetched
public
mixed
GIT_URL_CONTINUE
= '@@@@'
GIT_URL_EXTENSION
A fixed component to be used with Git base url to form Git first url
public
mixed
GIT_URL_EXTENSION
= 'info/refs?service=git-upload-pack'
GIT_URL_OBJECT
A fixed component to be used with Git urls to get next Git urls
public
mixed
GIT_URL_OBJECT
= 'objects/'
GIT_URL_SPLIT
A fixed indicator used to make desired Git folder structure from SHA hash
public
mixed
GIT_URL_SPLIT
= '/'
HEX_NULL_CHARACTER
A indicator to represent next position after the access code in Git tree object
public
mixed
HEX_NULL_CHARACTER
= "\x00"
INDICATOR_GIT
An indicator to indicate git repository
public
mixed
INDICATOR_GIT
= 'git'
INDICATOR_NONE
An indicator to tell no actions to be taken
public
mixed
INDICATOR_NONE
= 'none'
SHA_HASH_BINARY_END
Git SHA hash binary ending position
public
mixed
SHA_HASH_BINARY_END
= 20
SHA_HASH_BINARY_START
Git SHA hash binary starting position
public
mixed
SHA_HASH_BINARY_START
= 0
TREE_ACCESS_CODE_END
Git tree access code ending position
public
mixed
TREE_ACCESS_CODE_END
= 5
TREE_ACCESS_CODE_START
Git tree access code starting position
public
mixed
TREE_ACCESS_CODE_START
= 0
Properties
$all_git_urls
An array used to store all the Git internal urls
public
static array<string|int, mixed>
$all_git_urls
$repository_types
A list of meta words that might be extracted from a query
public
static array<string|int, mixed>
$repository_types
= ['git' => 'git', 'svn' => 'svn', 'cvs' => 'cvs', 'vss' => 'vss', 'mercurial' => 'mercurial', 'monotone' => 'monotone', 'bazaar' => 'bazaar', 'darcs' => 'darcs', 'arch' => 'arch']
Methods
checkForRepository()
Checks repository type based on extension
public
static checkForRepository(string $extension) : string
Parameters
- $extension : string
-
to check
Return values
string —$repository_type repository type based on the extension of urls
checkNestedStructure()
Checks the nested structure inside git tree object
public
static checkNestedStructure(string $sha_hash, string $git_base_url) : string
Parameters
- $sha_hash : string
-
sha of the git tree object
- $git_base_url : string
-
common portion of the parent git url
Return values
string —$blob_url contains url of the blob file inside the folder
checkPosition()
checks the position of access code for null values
public
static checkPosition(string $git_blob_position, string $git_tree_position, string $git_object_content) : array<string|int, mixed>
Parameters
- $git_blob_position : string
-
first occuence of git blob access code
- $git_tree_position : string
-
first occuence of git tree access code
- $git_object_content : string
-
compressed content of git master tree
Return values
array<string|int, mixed> —$git_object_positions length of the compressed content afterthe access code
fetchGitRepositoryUrl()
Get the Git internal urls from the parent Git url
public
static fetchGitRepositoryUrl(string $url_to_check) : an
Parameters
- $url_to_check : string
-
url needs to be processed
Return values
an —array $git_next_urls consists of list of Git internal urls which are called during the git clone
getGitData()
Makes the cURL call to get the contents
public
static getGitData(string $git_url) : string
Parameters
- $git_url : string
-
url to dowmload the contents
Return values
string —$git_content actual content of the git url
getGitMasterFile()
Get the Git second url which points to Git master tree structure
public
static getGitMasterFile(string $git_first_url_content, string $git_base_url) : string
Parameters
- $git_first_url_content : string
-
contents of Git first url
- $git_base_url : string
-
common portion of Git urls
Return values
string —$git_next_url consists of second internal Git url
getGitMasterTree()
Get the Git third url which contains the information about the organization of entire git repository
public
static getGitMasterTree(string $git_second_url_content, string $git_base_url) : string
Parameters
- $git_second_url_content : string
-
contents of Git second url
- $git_base_url : string
-
common portion of git urls
Return values
string —$git_next_url consists of third internal git url
getNextGitUrl()
Get the Git content from url which will be used to get the next git url
public
static getNextGitUrl(string $git_url, string $compression_indicator) : string
Parameters
- $git_url : string
-
git url to extract contents from it
- $compression_indicator : string
-
indicator for compress and uncompress contents
Return values
string —$git_object_content consists contents extracted from the url
getObjects()
Get the Git blob and tree objects
public
static getObjects(string $git_object_content, string $git_base_url) : array<string|int, mixed>
Parameters
- $git_object_content : string
-
compressed content of git master tree file
- $git_base_url : string
-
common content of git url
Return values
array<string|int, mixed> —$blob_url contains information and url for git blob objects
readBlobSha()
Get the details of the blob file i.e blob file name, sha hash and content
public
static readBlobSha(string $git_object_content, string $blob_position, string $length, string $git_base_url) : array<string|int, mixed>
Parameters
- $git_object_content : string
-
compressed content of git master tree
- $blob_position : string
-
first occuence of git blob access code in $content
- $length : string
-
length of the compressed content of git master tree
- $git_base_url : string
-
common portion of git url
Return values
array<string|int, mixed> —$git_blob_content contains details of git blob object
readTreeSha()
Get the details of the tree file i.e folder name, sha hash and blob url inside the tree
public
static readTreeSha(string $git_object_content, string $tree_position, string $length, string $git_base_url) : array<string|int, mixed>
Parameters
- $git_object_content : string
-
compressed content of git master tree
- $tree_position : string
-
first occuence of git tree access code in the $content
- $length : string
-
length of the compressed content of git master tree
- $git_base_url : string
-
common portion of git url
Return values
array<string|int, mixed> —$git_tree_content contains details of git blob object
setGitRepositoryUrl()
Sets up the seed sites with urls from a git repository (updates these sites if have already started downloading from repository)
public
static setGitRepositoryUrl(string $url_to_check, int $counter, array<string|int, mixed> $seeds, array<string|int, mixed> $repository_indicator, array<string|int, mixed> $site_value, int $total_git_urls, array<string|int, mixed> $all_git_urls) : array<string|int, mixed>
Parameters
- $url_to_check : string
-
url needs to be processed
- $counter : int
-
to keep track of number of urls processed
- $seeds : array<string|int, mixed>
-
store sites which are ready to be downloaded
- $repository_indicator : array<string|int, mixed>
-
indicates the type of the repository
- $site_value : array<string|int, mixed>
-
contains original Git url crawled
- $total_git_urls : int
-
number of urls in repository less those already processed
- $all_git_urls : array<string|int, mixed>
-
current list of urls from git repository
Return values
array<string|int, mixed> —$git_internal_urls containing all the internal Git urls fetched from the parent Git url
urlMaker()
Makes the git clone internal url for blob objects
public
static urlMaker(string $sha_hash, string $git_base_url) : string
Parameters
- $sha_hash : string
-
of the git blob object
- $git_base_url : string
-
common portion of git url
Return values
string —$git_object_url contains the complete url of the blob file