AddressesPlugin
extends IndexingPlugin
in package
implements
CrawlConstants
Used to extract emails, phone numbers, and addresses from a web page.
These are extracted into the EMAILS, PHONE_NUMBERS, and ADDRESSES fields of the page's summary.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $countries : array<string|int, mixed>
- Associative array of world countries and country code. Some entries are duplicated into country's local script
- $db : object
- Reference to a database object that might be used by models on this plugin
- $index_archive : object
- The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
- $regions : array<string|int, mixed>
- List of common regions, abbreviations, and local spellings of regions of the US, Canada, Australia, UK, as well as major cities elsewhere
- __construct() : mixed
- Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object
- checkCandidate() : mixed
- Checks if the passed sequence of lines has enough features of a postal address to call it an address. If so, return the address as a single string
- checkCountry() : bool
- Used to check if a line contains a word associated with a World country or country code.
- checkPhoneOrEmail() : bool
- Used to check if a line contains either an email address or a phone number
- checkRegion() : bool
- Used to check if a line contains a word associated with a province, state or major city.
- checkStreet() : bool
- Used to check if a given line in an address candidate has features associated with being a street address.
- checkZipPostalCodeWords() : bool
- Used to check if a line contains a word associated with a ZIP or Postal code
- getAdditionalMetaWords() : array<string|int, mixed>
- Returns an array of additional meta words which have been added by this plugin
- getProcessors() : array<string|int, mixed>
- Which mime type page processors this plugin should do additional processing for
- pageProcessing() : array<string|int, mixed>
- This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
- pageSummaryProcessing() : mixed
- Adjusts the document summary of a page after the page processor's process method has been called so that the subdoc's fields associated with the addresses plugin get copied as fields of the whole page summary. Then it deletes the subdoc fields.
- parseEmails() : string
- Extracts substrings from the provided $line that are in the format of an email address. Returns first email from line
- parsePhones() : array<string|int, mixed>
- Checks for a phone number related keyword in the line and if found extracts digits which are presumed to be a phone number
- parseSubdoc() : array<string|int, mixed>
- Parses EMAILS, PHONE_NUMBERS and ADDRESSES from $text and returns an array with these three fields containing sub-arrays of the given items
- postProcessing() : mixed
- This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
Properties
$countries
Associative array of world countries and country code. Some entries are duplicated into country's local script
public
array<string|int, mixed>
$countries
= ["ANDORRA" => "AD", "UNITED ARAB EMIRATES" => "AE", "AFGHANISTAN" => "AF", "ANTIGUA AND BARBUDA" => "AG", "ANGUILLA" => "AI", "ALBANIA" => "AL", "ARMENIA" => "AM", "ANGOLA" => "AO", "ANTARCTICA" => "AQ", "ARGENTINA" => "AR", "AMERICAN SAMOA" => "AS", "AUSTRIA" => "AT", "AUSTRALIA" => "AU", "ARUBA" => "AW", "ÅLAND ISLANDS" => "AX", "AZERBAIJAN" => "AZ", "BOSNIA AND HERZEGOVINA" => "BA", "BARBADOS" => "BB", "BANGLADESH" => "BD", "BELGIUM" => "BE", "BURKINA FASO" => "BF", "BULGARIA" => "BG", "BAHRAIN" => "BH", "BURUNDI" => "BI", "BENIN" => "BJ", "SAINT BARTHELEMY" => "BL", "BERMUDA" => "BM", "BRUNEI DARUSSALAM" => "BN", "BOLIVIA" => "BO", "BONAIRE, SINT EUSTATIUS AND SABA" => "BQ", "BRAZIL" => "BR", "BAHAMAS" => "BS", "BHUTAN" => "BT", "BOUVET ISLAND" => "BV", "BOTSWANA" => "BW", "BELARUS" => "BY", "BELIZE" => "BZ", "CANADA" => "CA", "COCOS ISLANDS" => "CC", "DEMOCRATIC REPUBLIC OF THE CONGO" => "CD", "CENTRAL AFRICAN REPUBLIC" => "CF", "CONGO" => "CG", "SWITZERLAND" => "CH", "COTE D'IVOIRE" => "CI", "COOK ISLANDS" => "CK", "CHILE" => "CL", "CAMEROON" => "CM", "CHINA" => "CN", "中国" => "China", "COLOMBIA" => "CO", "COSTA RICA" => "CR", "CUBA" => "CU", "CAPE VERDE" => "CV", "CURACAO" => "CW", "CHRISTMAS ISLAND" => "CX", "CYPRUS" => "CY", "CZECH REPUBLIC" => "CZ", "GERMANY" => "DE", "DJIBOUTI" => "DJ", "DENMARK" => "DK", "DOMINICA" => "DM", "DOMINICAN REPUBLIC" => "DO", "ALGERIA" => "DZ", "ECUADOR" => "EC", "ESTONIA" => "EE", "EGYPT" => "EG", "WESTERN SAHARA" => "EH", "ERITREA" => "ER", "SPAIN" => "ES", "ETHIOPIA" => "ET", "FINLAND" => "FI", "FIJI" => "FJ", "FALKLAND ISLANDS (MALVINAS)" => "FK", "MICRONESIA, FEDERATED STATES OF" => "FM", "FAROE ISLANDS" => "FO", "FRANCE" => "FR", "GABON" => "GA", "UNITED KINGDOM" => "GB", "GRENADA" => "GD", "GEORGIA" => "GE", "FRENCH GUIANA" => "GF", "GUERNSEY" => "GG", "GHANA" => "GH", "GIBRALTAR" => "GI", "GREENLAND" => "GL", "GAMBIA" => "GM", "GUINEA" => "GN", "GUADELOUPE" => "GP", "EQUATORIAL GUINEA" => "GQ", "GREECE" => "GR", "SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS" => "GS", "GUATEMALA" => "GT", "GUAM" => "GU", "GUINEA-BISSAU" => "GW", "GUYANA" => "GY", "HONG KONG" => "HK", "HEARD ISLAND AND MCDONALD ISLANDS" => "HM", "HONDURAS" => "HN", "CROATIA" => "HR", "HAITI" => "HT", "HUNGARY" => "HU", "INDONESIA" => "ID", "IRELAND" => "IE", "ISRAEL" => "IL", "ISLE OF MAN" => "IM", "INDIA" => "IN", "BRITISH INDIAN OCEAN TERRITORY" => "IO", "IRAQ" => "IQ", "IRAN" => "IR", "ICELAND" => "IS", "ITALY" => "IT", "JERSEY" => "JE", "JAMAICA" => "JM", "JORDAN" => "JO", "JAPAN" => "JP", "日本" => "JA", "KENYA" => "KE", "KYRGYZSTAN" => "KG", "CAMBODIA" => "KH", "KIRIBATI" => "KI", "COMOROS" => "KM", "SAINT KITTS AND NEVIS" => "KN", "NORTH KOREA" => "KP", "SOUTH KOREA" => "KR", "한국" => "KR", "KUWAIT" => "KW", "CAYMAN ISLANDS" => "KY", "KAZAKHSTAN" => "KZ", "LAOS" => "LA", "LEBANON" => "LB", "SAINT LUCIA" => "LC", "LIECHTENSTEIN" => "LI", "SRI LANKA" => "LK", "LIBERIA" => "LR", "LESOTHO" => "LS", "LITHUANIA" => "LT", "LUXEMBOURG" => "LU", "LATVIA" => "LV", "LIBYA" => "LY", "MOROCCO" => "MA", "MONACO" => "MC", "MOLDOVA, REPUBLIC OF" => "MD", "MONTENEGRO" => "ME", "SAINT MARTIN" => "MF", "MADAGASCAR" => "MG", "MARSHALL ISLANDS" => "MH", "MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF" => "MK", "MALI" => "ML", "MYANMAR" => "MM", "MONGOLIA" => "MN", "MACAO" => "MO", "NORTHERN MARIANA ISLANDS" => "MP", "MARTINIQUE" => "MQ", "MAURITANIA" => "MR", "MONTSERRAT" => "MS", "MALTA" => "MT", "MAURITIUS" => "MU", "MALDIVES" => "MV", "MALAWI" => "MW", "MEXICO" => "MX", "MALAYSIA" => "MY", "MOZAMBIQUE" => "MZ", "NAMIBIA" => "NA", "NEW CALEDONIA" => "NC", "NIGER" => "NE", "NORFOLK ISLAND" => "NF", "NIGERIA" => "NG", "NICARAGUA" => "NI", "NETHERLANDS" => "NL", "NORWAY" => "NO", "NEPAL" => "NP", "NAURU" => "NR", "NIUE" => "NU", "NEW ZEALAND" => "NZ", "OMAN" => "OM", "PANAMA" => "PA", "PERU" => "PE", "FRENCH POLYNESIA" => "PF", "PAPUA NEW GUINEA" => "PG", "PHILIPPINES" => "PH", "PAKISTAN" => "PK", "POLAND" => "PL", "SAINT PIERRE AND MIQUELON" => "PM", "PITCAIRN" => "PN", "PUERTO RICO" => "PR", "PALESTINE, STATE OF" => "PS", "PORTUGAL" => "PT", "PALAU" => "PW", "PARAGUAY" => "PY", "QATAR" => "QA", "REUNION" => "RE", "ROMANIA" => "RO", "SERBIA" => "RS", "RUSSIA" => "RU", "Россия" => "RU", "RWANDA" => "RW", "SAUDI ARABIA" => "SA", "SOLOMON ISLANDS" => "SB", "SEYCHELLES" => "SC", "SUDAN" => "SD", "SWEDEN" => "SE", "SINGAPORE" => "SG", "SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA" => "SH", "SLOVENIA" => "SI", "SVALBARD AND JAN MAYEN" => "SJ", "SLOVAKIA" => "SK", "SIERRA LEONE" => "SL", "SAN MARINO" => "SM", "SENEGAL" => "SN", "SOMALIA" => "SO", "SURINAME" => "SR", "SOUTH SUDAN" => "SS", "SAO TOME AND PRINCIPE" => "ST", "EL SALVADOR" => "SV", "SINT MAARTEN" => "SX", "SYRIAN ARAB REPUBLIC" => "SY", "SWAZILAND" => "SZ", "TURKS AND CAICOS ISLANDS" => "TC", "CHAD" => "TD", "FRENCH SOUTHERN TERRITORIES" => "TF", "TOGO" => "TG", "THAILAND" => "TH", "TAJIKISTAN" => "TJ", "TOKELAU" => "TK", "TIMOR-LESTE" => "TL", "TURKMENISTAN" => "TM", "TUNISIA" => "TN", "TONGA" => "TO", "TURKEY" => "TR", "TRINIDAD AND TOBAGO" => "TT", "TUVALU" => "TV", "TAIWAN" => "TW", "臺灣" => "TW", "TANZANIA, UNITED REPUBLIC OF" => "TZ", "UKRAINE" => "UA", "UGANDA" => "UG", "UNITED STATES MINOR OUTLYING ISLANDS" => "UM", "UNITED STATES OF AMERICA" => "USA", "UNITED STATES" => "US", "URUGUAY" => "UY", "UZBEKISTAN" => "UZ", "VATICAN CITY" => "VA", "SAINT VINCENT AND THE GRENADINES" => "VC", "VENEZUELA, BOLIVARIAN REPUBLIC OF" => "VE", "BRITISH VIRGIN ISLANDS" => "VG", "U.S. VIRGIN ISLANDS" => "VI", "VIETNAM" => "VN", "VANUATU" => "VU", "WALLIS AND FUTUNA" => "WF", "SAMOA" => "WS", "YEMEN" => "YE", "MAYOTTE" => "YT", "SOUTH AFRICA" => "ZA", "ZAMBIA" => "ZM", "ZIMBABWE" => "ZW"]
$db
Reference to a database object that might be used by models on this plugin
public
object
$db
$index_archive
The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
public
object
$index_archive
$regions
List of common regions, abbreviations, and local spellings of regions of the US, Canada, Australia, UK, as well as major cities elsewhere
public
array<string|int, mixed>
$regions
= ["ALABAMA", "AL", "ALASKA", "AK", "ARIZONA", "AZ", "ARKANSAS", "AR", "CALIFORNIA", "CA", "COLORADO", "CO", "CONNECTICUT", "CT", "DELAWARE", "DE", "FLORIDA", "FL", "GEORGIA", "GA", "HAWAII", "HI", "IDAHO", "ID", "ILLINOIS", "IL", "INDIANA", "IN", "IOWA", "IA", "KANSAS", "KS", "KENTUCKY", "KY", "LOUISIANA", "LA", "MAINE", "ME", "MARYLAND", "MD", "MASSACHUSETTS", "MA", "MICHIGAN", "MI", "MINNESOTA", "MN", "MISSISSIPPI", "MS", "MISSOURI", "MO", "MONTANA", "MT", "NEBRASKA", "NE", "NEVADA", "NV", "HAMPSHIRE", "NH", "NEW JERSEY", "NJ", "MEXICO", "NM", "NEW YORK", "NY", "NC", "NORTH DAKOTA", "ND", "OHIO", "OH", "OKLAHOMA", "OK", "OREGON", "OR", "PENNSYLVANIA", "PA", "RHODE", "RI", "CAROLINA", "SC", "DAKOTA", "SD", "TENNESSEE", "TN", "TEXAS", "TX", "UTAH", "UT", "VERMONT", "VT", "VIRGINIA", "VA", "WASHINGTON", "WA", "WV", "WISCONSIN", "WI", "WYOMING", "WY", "SAMOA", "AS", "COLUMBIA", "DC", "MICRONESIA", "FM", "GUAM", "GU", "MARSHALL", "MH", "MARIANA", "MP", "PALAU", "PW", "PUERTO", "RICO", "PR", "VIRGIN", "ISLANDS", "VI", "ALBERTA", "AB", "BRITISH", "COLUMBIA", "BC", "MANITOBA", "MB", "NEW BRUNSWICK", "NB", "NEWFOUNDLAND", "NL", "NORTHWEST", "TERRITORIES", "NT", "NOVA SCOTIA", "NS", "NUNAVUT", "NU", "ONTARIO", "ON", "PRINCE EDWARD ISLAND", "PE", "QUEBEC", "QC", "SASKATCHEWAN", "SK", "YUKON", "YT", "CAPITAL", "ACT", "CHRISTMAS", "CX", "COCOS ISLANDS", "CC", "JERVIS", "BAY", "JBT", "SOUTH", "WALES", "NSW", "NORFOLK", "NF", "NT", "QUEENSLAND", "QLD", "SA", "TASMANIA", "TAS", "VICTORIA", "VIC", "WA", "ABERDEENSHIRE", "ABD", "ANGLESEY", "AGY", "ALDERNEY", "ALD", "ANGUS", "ANS", "ANTRIM", "ANT", "ARGYLLSHIRE", "ARL", "ARMAGH", "ARM", "AVON", "AVN", "AYRSHIRE", "AYR", "BANFFSHIRE", "BAN", "BEDFORDSHIRE", "BDF", "BERWICKSHIRE", "BEW", "BUCKINGHAMSHIRE", "BKM", "BORDERS", "BOR", "BRECONSHIRE", "BRE", "BERKSHIRE", "BRK", "BUTE", "BUT", "CAERNARVONSHIRE", "CAE", "CAITHNESS", "CAI", "CAMBRIDGESHIRE", "CAM", "CARLOW", "CAR", "CAVAN", "CAV", "CENTRAL", "CEN", "CARDIGANSHIRE", "CGN", "CHESHIRE", "CHS", "CLARE", "CLA", "CLACKMANNANSHIRE", "CLK", "CLEVELAND", "CLV", "CUMBRIA", "CMA", "CARMARTHENSHIRE", "CMN", "CORNWALL", "CON", "CORK", "COR", "CUMBERLAND", "CUL", "CLWYD", "CWD", "DERBYSHIRE", "DBY", "DENBIGHSHIRE", "DEN", "DEVON", "DEV", "DYFED", "DFD", "DUMFRIES-SHIRE", "DFS", "DUMFRIES", "GALLOWAY", "DGY", "DUNBARTONSHIRE", "DNB", "DONEGAL", "DON", "DORSET", "DOR", "DOWN", "DOW", "DUBLIN", "DUB", "DURHAM", "DUR", "ELN", "ERY", "ESSEX", "ESS", "FERMANAGH", "FER", "FIFE", "FIF", "FLINTSHIRE", "FLN", "GALWAY", "GAL", "GLAMORGAN", "GLA", "GLOUCESTERSHIRE", "GLS", "GRAMPIAN", "GMP", "GWENT", "GNT", "GUERNSEY", "GSY", "MANCHESTER", "GTM", "GWYNEDD", "GWN", "HAMPSHIRE", "HAM", "HEREFORDSHIRE", "HEF", "HIGHLAND", "HLD", "HERTFORDSHIRE", "HRT", "HUMBERSIDE", "HUM", "HUNTINGDONSHIRE", "HUN", "HEREFORD", "WORCESTER", "HWR", "INVERNESS-SHIRE", "INV", "WIGHT", "IOW", "JERSEY", "JSY", "KINCARDINESHIRE", "KCD", "KENT", "KEN", "KERRY", "KER", "KILDARE", "KID", "KILKENNY", "KIK", "KIRKCUDBRIGHTSHIRE", "KKD", "KINROSS-SHIRE", "KRS", "LANCASHIRE", "LAN", "LONDONDERRY", "LDY", "LEICESTERSHIRE", "LEI", "LEITRIM", "LET", "LAOIS", "LEX", "LIMERICK", "LIM", "LINCOLNSHIRE", "LIN", "LANARKSHIRE", "LKS", "LONGFORD", "LOG", "LOUTH", "LOU", "LOTHIAN", "LTN", "MAYO", "MAY", "MEATH", "MEA", "MERIONETHSHIRE", "MER", "GLAMORGAN", "MGM", "MONTGOMERYSHIRE", "MGY", "MIDLOTHIAN", "MLN", "MONAGHAN", "MOG", "MONMOUTHSHIRE", "MON", "MORAYSHIRE", "MOR", "MERSEYSIDE", "MSY", "NAIRN", "NAI", "NORTHUMBERLAND", "NBL", "NORFOLK", "NFK", "NORTH RIDING OF YORKSHIRE", "NRY", "NORTHAMPTONSHIRE", "NTH", "NOTTINGHAMSHIRE", "NTT", "NYK", "OFFALY", "OFF", "ORKNEY", "OKI", "OXFORDSHIRE", "OXF", "PEEBLES-SHIRE", "PEE", "PEMBROKESHIRE", "PEM", "PERTH", "PER", "POWYS", "POW", "RADNORSHIRE", "RAD", "RENFREWSHIRE", "RFW", "ROSS", "CROMARTY", "ROC", "ROSCOMMON", "ROS", "ROXBURGHSHIRE", "ROX", "RUTLAND", "RUT", "SHROPSHIRE", "SAL", "SELKIRKSHIRE", "SEL", "SUFFOLK", "SFK", "GLAMORGAN", "SGM", "SHETLAND", "SHI", "SLIGO", "SLI", "SOMERSET", "SOM", "SARK", "SRK", "SURREY", "SRY", "SUSSEX", "SSX", "STRATHCLYDE", "STD", "STIRLINGSHIRE", "STI", "STAFFORDSHIRE", "STS", "SUTHERLAND", "SUT", "SUSSEX", "SXE", "SXW", "SYK", "TAYSIDE", "TAY", "TIPPERARY", "TIP", "TYNE", "TWR", "TYRONE", "TYR", "WARWICKSHIRE", "WAR", "WATERFORD", "WAT", "WESTMEATH", "WEM", "WESTMORLAND", "WES", "WEXFORD", "WEX", "WEST GLAMORGAN", "WGM", "WICKLOW", "WIC", "WIGTOWNSHIRE", "WIG", "WILTSHIRE", "WIL", "ISLES", "WIS", "LOTHIAN", "WLN", "WEST MIDLANDS", "WMD", "WORCESTERSHIRE", "WOR", "WRY", "WEST", "WYK", "YORKSHIRE", "YKS", "HELSINKI", "МОСКВА", "上海", "北京", "南京", "成都", "HONG", "KONG", "TOKYO", "SEOUL", "東京", "香港", "서울", "MADRID", "BARCELONA", "ROME", "PARIS", "MARSEILLE", "TOULOUSE", "LYON", "ORLEAN", "BRUSSELS", "DELHI", "UTRECHT", "COPENHAGEN", "BERLIN", "FRANKFURT", "MÜNCHEN", "MUNICH", "VIENNA", "ISTANBUL", "ΑΘΗΝΑ", "ATHENS", "ПЕТЕРБУРГ", "BUENOS", "AIRES", "RIO", "JANEIRO", "MANILA", "深圳", "CHICAGO", "KARACHI", "BANGKOK", "LAGOS", "JOHANNESBERG", "FRANSCICO", "TORONTO", "MIAMI", "PHILADELPHIA", "KUALA", "LAMPUR", "ESSEN", "LONDON", "KINSHASA", "BOSTON", "AMSTERDAM", "臺北", "武漢", "AHMEDABAD", "BANGALORE", "HYDERABAD", "BAGHDAD", "LIMA", "名古屋", "ANGELES", "SANTIAGO", "MILANO", "HOUSTON", "SHÀNGHAISHÌ", "AP", "ANDHRA", "AR", "ARUNACHAL", "AS", "ASSAM", "BR", "BIHAR", "CT", "CHHATTISGARH", "GA", "GOA", "GJ", "GUJARAT", "HR", "HARYANA", "HP", "HIMACHAL", "JK", "JAMMU", "KASHMIR", "JH", "JHARKHAND", "KA", "KARNATAKA", "KL", "KERALA", "MP", "MADHYA", "MH", "MAHARASHTRA", "MN", "MANIPUR", "ML", "MEGHALAYA", "MZ", "MIZORAM", "NL", "NAGALAND", "OR", "ORISSA", "PB", "PUNJAB", "RJ", "RAJASTHAN", "SK", "SIKKIM", "TN", "TAMIL", "NADU", "TR", "TRIPURA", "UT", "UTTARAKHAND", "UP", "UTTAR", "PRADESH", "WB", "BENGAL", "ANDAMAN", "NICOBAR", "CH", "CHANDIGARH", "DN", "DADRA", "NAGAR", "HAVELI", "DD", "DAMAN", "DIU", "DL", "LD", "LAKSHADWEEP", "PY", "PUDUCHERRY", "PONDICHERRY", "澳門半島"]
Methods
__construct()
Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object
public
__construct() : mixed
Return values
mixed —checkCandidate()
Checks if the passed sequence of lines has enough features of a postal address to call it an address. If so, return the address as a single string
public
checkCandidate(array<string|int, mixed> $pre_address) : mixed
Parameters
- $pre_address : array<string|int, mixed>
-
an array of potential address lines
Return values
mixed —false if not address, the lines imploded together using space if an address
checkCountry()
Used to check if a line contains a word associated with a World country or country code.
public
checkCountry(string $line) : bool
Parameters
- $line : string
-
from address to check
Return values
bool —whether it contains a country term
checkPhoneOrEmail()
Used to check if a line contains either an email address or a phone number
public
checkPhoneOrEmail(string $line) : bool
Parameters
- $line : string
-
from address to check
Return values
bool —whether it contains acountry term
checkRegion()
Used to check if a line contains a word associated with a province, state or major city.
public
checkRegion(string $line) : bool
Parameters
- $line : string
-
from address to check
Return values
bool —whether it contains acountry term
checkStreet()
Used to check if a given line in an address candidate has features associated with being a street address.
public
checkStreet(string $line) : bool
Parameters
- $line : string
-
address line to check
Return values
bool —whether or not it contains a word identified with being a street address such as WAY, AVENUE, STREET, etc.
checkZipPostalCodeWords()
Used to check if a line contains a word associated with a ZIP or Postal code
public
checkZipPostalCodeWords(string $line) : bool
Parameters
- $line : string
-
from address to check
Return values
bool —whether it contains such a code
getAdditionalMetaWords()
Returns an array of additional meta words which have been added by this plugin
public
static getAdditionalMetaWords() : array<string|int, mixed>
Return values
array<string|int, mixed> —meta words and maximum description length of results allowed for that meta word
getProcessors()
Which mime type page processors this plugin should do additional processing for
public
static getProcessors() : array<string|int, mixed>
Return values
array<string|int, mixed> —an array of page processors
pageProcessing()
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
public
pageProcessing(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
array<string|int, mixed> —consisting of a sequence of subdoc arrays found on the given page.
pageSummaryProcessing()
Adjusts the document summary of a page after the page processor's process method has been called so that the subdoc's fields associated with the addresses plugin get copied as fields of the whole page summary. Then it deletes the subdoc fields.
public
pageSummaryProcessing(array<string|int, mixed> &$summary, string $url) : mixed
Parameters
- $summary : array<string|int, mixed>
-
of current document. It will be adjusted by the code below
- $url : string
-
the url where the summary contents came from
Return values
mixed —parseEmails()
Extracts substrings from the provided $line that are in the format of an email address. Returns first email from line
public
parseEmails(string $line) : string
Parameters
- $line : string
-
string to extract email from
Return values
string —first email found on line
parsePhones()
Checks for a phone number related keyword in the line and if found extracts digits which are presumed to be a phone number
public
parsePhones(string $line) : array<string|int, mixed>
Parameters
- $line : string
-
to check for phone numbers
Return values
array<string|int, mixed> —all phone numbers detected by this method from the $line
parseSubdoc()
Parses EMAILS, PHONE_NUMBERS and ADDRESSES from $text and returns an array with these three fields containing sub-arrays of the given items
public
parseSubdoc(string $text) : array<string|int, mixed>
Parameters
- $text : string
-
to use for extraction
Return values
array<string|int, mixed> —with found emails, phone numbers, and addresses
postProcessing()
This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
public
postProcessing(string $index_name) : mixed
Parameters
- $index_name : string
-
the name/timestamp of an IndexArchiveBundle to do post processing for