Make Aux Url Info a text area, add to wiki documentation, a=chris

Chris Pollett [2019-01-17 02:Jan:th]

Make Aux Url Info a text area, add to wiki documentation, a=chris

Filename
src/configs/PublicHelpPages.php
src/library/media_jobs/FeedsUpdateJob.php
src/library/media_jobs/WikiMediaJob.php
src/locale/en_US/configure.ini
src/views/elements/SearchsourcesElement.php

diff --git a/src/configs/PublicHelpPages.php b/src/configs/PublicHelpPages.php
index cbed7ed2a..4b7aed4d1 100644
--- a/src/configs/PublicHelpPages.php
+++ b/src/configs/PublicHelpPages.php
@@ -2050,7 +2050,23 @@ Yioop supports the downloading of single video or audio file sources, as well as
 &lt;br /&gt;

 A &#039;&#039;&#039;Scrape podcast source&#039;&#039;&#039; is like a &#039;&#039;&#039;Feed Podcast source&#039;&#039;&#039;, but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site.
-The URL field should be the page with the periodically updated link. The &#039;&#039;&#039;Aux Url XPath&#039;&#039;&#039; link, if not blank, should be an xpath on this page to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, &#039;&#039;&#039;Download XPath&#039;&#039;&#039; should be the xpath of the url of the video or audio file to download.
+The URL field should be the page with the periodically updated link. The &#039;&#039;&#039;Aux Url XPaths&#039;&#039;&#039; field, if not blank, should be a sequence of xpaths or regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line&#039;s xpath or regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, &#039;&#039;&#039;Download XPath&#039;&#039;&#039; should be the xpath of the url of the video or audio file to download.
+If a regex is used rather than an xpath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Nightly News  Scrape Podcast to a wiki group:
+
+ Type: Scrape Podcast
+ Name: Nightly News Podcast
+ URL: https://www.somenetwork.com/nightly-news
+ Language: English
+ Aux Url XPaths:
+ /(https\:\/\/cdn.somenetwork.com\/nightly-news-netcast\/video\/nightly-[^\&quot;]+)\&quot;/
+ /window\.\_\_data\s*\=\s*([^\n]+\}\;)/json|video|current|0|publicUrl
+ Download Xpath: //video[contains(@height,&#039;540&#039;)]
+ Wiki Destination: My Private Group@Podcasts/%Y-%m-%d.mp4
+
+The initial page to be download will be: https://www.somenetwork.com/nightly-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/nightly-news-netcast\/video\/nightly-[^\&quot;]+)\&quot;/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url:
+ https://cdn.somenetwork.com/nightly-news-netcast/video/nightly-safghdsjfg
+This url is then downloaded and a string matching  the pattern /window\.\_\_data\s*\=\s*([^\n]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^\n]+\}\;) is then converted to a JSON object, becausee of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download Xpath is then used to actually get the final video link from this downloaded page.
+Once this video is downloaded, it is stored in the Podcasts page&#039;s resource folder of the the My Private Group wiki group in a file with a name in the format: %Y-%m-%d.mp4.
 EOD;
 $help_pages["en-US"]["Monetization"] = <<< EOD
 page_type=standard
diff --git a/src/library/media_jobs/FeedsUpdateJob.php b/src/library/media_jobs/FeedsUpdateJob.php
index 9dd0789fb..19ec1574f 100644
--- a/src/library/media_jobs/FeedsUpdateJob.php
+++ b/src/library/media_jobs/FeedsUpdateJob.php
@@ -221,8 +221,9 @@ class FeedsUpdateJob extends MediaJob
         $test_results = "";
         $log_function = function ($msg, $log_tag = "pre class='source-test'")
             use (&$test_results, $test_mode) {
+            $close_tag= preg_split("/\s+/",$log_tag)[0];
             if ($test_mode) {
-                $test_results .= "<$log_tag>$msg</$log_tag>\n";
+                $test_results .= "<$log_tag>$msg</$close_tag>\n";
             } else {
                 L\crawlLog($msg);
             }
diff --git a/src/library/media_jobs/WikiMediaJob.php b/src/library/media_jobs/WikiMediaJob.php
index 59c3d5684..2610fc83a 100644
--- a/src/library/media_jobs/WikiMediaJob.php
+++ b/src/library/media_jobs/WikiMediaJob.php
@@ -235,8 +235,9 @@ class WikiMediaJob extends MediaJob
         $test_results = "";
         $log_function = function ($msg, $log_tag = "pre class='source-test'")
             use (&$test_results, $test_mode) {
+            $close_tag= preg_split("/\s+/",$log_tag)[0];
             if ($test_mode) {
-                $test_results .= "<$log_tag>$msg</$log_tag>\n";
+                $test_results .= "<$log_tag>$msg</$close_tag>\n";
             } else {
                 L\crawlLog($msg);
             }
@@ -319,8 +320,9 @@ class WikiMediaJob extends MediaJob
         $test_results = "";
         $log_function = function ($msg, $log_tag = "pre class='source-test'")
             use (&$test_results, $test_mode) {
+            $close_tag= preg_split("/\s+/",$log_tag)[0];
             if ($test_mode) {
-                $test_results .= "<$log_tag>$msg</$log_tag>\n";
+                $test_results .= "<$log_tag>$msg</$close_tag>\n";
             } else {
                 L\crawlLog($msg);
             }
@@ -329,7 +331,9 @@ class WikiMediaJob extends MediaJob
         $dom = $this->createDOMDocument($page);
         $source_url = $podcast["SOURCE_URL"];
         if (!empty($podcast['AUX_URL_XPATH'])) {
-            $sub_aux_xpaths = explode("##", $podcast['AUX_URL_XPATH']);
+            $sub_aux_xpaths = explode("\n", $podcast['AUX_URL_XPATH']);
+            $log_function("...Processing the following AUX PATHS:", "h3");
+            $log_function(print_r($sub_aux_xpaths, true));
             foreach ($sub_aux_xpaths as $aux_xpath) {
                 $aux_url = $this->getLinkFromQueryPage($aux_xpath,
                     $page, $dom, $source_url);
@@ -391,7 +395,16 @@ class WikiMediaJob extends MediaJob
         return $dom;
     }
     /**
+     * Used to extract a URL from a pagee either as a string of in dom form
+     * and to canonicalize it based on a starting url.
      *
+     * @param string $xpath either an xpath to look into a dom object or
+     *      a regex to search a page as a string
+     * @param string $page source page to search in as a string
+     * @param string $dom source page as a dom object
+     * @param string $source_url url to use to canonicalize an incomplete
+     *  url if the extraction only produces part of a url
+     * @return string desired url link
      */
     public function getLinkFromQueryPage($xpath, $page, $dom, $source_url)
     {
@@ -403,7 +416,7 @@ class WikiMediaJob extends MediaJob
         if ($nodes === false) {
             $regex_json_parts = explode("json|", $xpath);
             set_error_handler(null);
-            @preg_match_all($regex_json_parts[0], $page, $matches);
+            @preg_match_all(trim($regex_json_parts[0]), $page, $matches);
             set_error_handler(C\NS_CONFIGS . "yioop_error_handler");
             if (!empty($matches[1][0])) {
                 $url = $matches[1][0];
@@ -464,8 +477,9 @@ class WikiMediaJob extends MediaJob
         $test_results = "";
         $log_function = function ($msg, $log_tag = "pre class='source-test'")
             use (&$test_results, $test_mode) {
+            $close_tag= preg_split("/\s+/",$log_tag)[0];
             if ($test_mode) {
-                $test_results .= "<$log_tag>$msg</$log_tag>\n";
+                $test_results .= "<$log_tag>$msg</$close_tag>\n";
             } else {
                 L\crawlLog($msg);
             }
diff --git a/src/locale/en_US/configure.ini b/src/locale/en_US/configure.ini
index 47992fb0c..b5000dafb 100644
--- a/src/locale/en_US/configure.ini
+++ b/src/locale/en_US/configure.ini
@@ -1143,7 +1143,7 @@ searchsources_element_expires = "Expires:"
 searchsources_element_thumbnail = "Thumb:"
 searchsources_element_feed_instruct = "Provide xpaths to feed components below:"
 searchsources_element_regex_instruct = "Enter feed regexes. Regexes except Item separator should have 1 capture group."
-searchsources_element_aux_url_xpath = "Aux Url XPath:"
+searchsources_element_aux_url_xpath = "Aux Url XPaths:"
 searchsources_element_channelpath = "Channel:"
 searchsources_element_item_text = "Item:"
 searchsources_element_item_regex = "Item Separator:"
diff --git a/src/views/elements/SearchsourcesElement.php b/src/views/elements/SearchsourcesElement.php
index 1f7690c17..5406d7d73 100644
--- a/src/views/elements/SearchsourcesElement.php
+++ b/src/views/elements/SearchsourcesElement.php
@@ -179,7 +179,8 @@ class SearchsourcesElement extends Element
             tl('searchsources_element_aux_url_xpath');
             ?></span><span id="channel-text"><?=
             tl('searchsources_element_channelpath') ?></span></b></label>
-            </td><td><input type="text" id="channel-path" name="channel_path"
+            </td><td id='channel-aux'><input type="text"
+                id="channel-path" name="channel_path"
                 value="<?= $data['CURRENT_SOURCE']['channel_path'] ?>"
                 maxlength="<?= $sub_aux_len ?>"
                 class="wide-field" /></td></tr>
@@ -277,7 +278,14 @@ class SearchsourcesElement extends Element
                 <?= $source['LANGUAGE'] ?><br />
                 <b><?=($is_feed) ? tl('searchsources_element_category')
                     : tl('searchsources_element_expires'); ?></b>
-                <?= $source['CATEGORY'] ?><br />
+                <?php
+                    if (in_array($source['TYPE'], ["feed_podcast",
+                        "scrape_podcast"])) {
+                            echo $data['PODCAST_EXPIRES'][$source['CATEGORY']];
+                    } else {
+                        echo $source['CATEGORY'];
+                    }
+                ?><br />
                 <b><?= tl('searchsources_element_url') ?></b>
                 <pre><?= $source['SOURCE_URL']?></pre>
                 <b><?= tl('searchsources_element_aux_info') ?></b><br />
@@ -417,9 +425,22 @@ class SearchsourcesElement extends Element
         </table>
         </div>
         <script>
+        <?php
+        $channel_string = json_encode(
+            html_entity_decode($data['CURRENT_SOURCE']['channel_path']));
+        ?>
         function switchSourceType()
         {
             var stype = elt("source-type");
+            channel_string = <?= $channel_string ?>;
+            channel_inner = '<input type="text"' +
+                'id="channel-path" name="channel_path" '+
+                'value="' + channel_string + '" ' +
+                'maxlength="<?= $sub_aux_len ?>" ' +
+                'class="wide-field" />';
+            aux_inner = '<textarea class="short-text-area" ' +
+                'id="channel-path" name="channel_path">' +
+                channel_string +'</textarea>';
             stype = stype.options[stype.selectedIndex].value;
             if (stype == "html" || stype == 'json' || stype == 'regex') {
                 setDisplay("thumb-text", false);
@@ -436,6 +457,7 @@ class SearchsourcesElement extends Element
                     setDisplay("instruct", true);
                 }
                 setDisplay("channel-text", true);
+                elt('channel-aux').innerHTML = channel_inner;
                 setDisplay("aux-url-xpath", false);
                 setDisplay("wiki-page-text", false);
                 setDisplay("channel-path", true);
@@ -496,6 +518,7 @@ class SearchsourcesElement extends Element
                 setDisplay("channel-text", false);
                 setDisplay("wiki-page-text", true);
                 setDisplay("aux-url-xpath", true);
+                elt('channel-aux').innerHTML = aux_inner;
                 setDisplay("channel-path", true);
                 setDisplay("item-text", false);
                 setDisplay("item-text-regex", false);

ViewGit