Asked  7 Months ago    Answers:  5   Viewed   30 times

Notice how Google News has sources on the bottom of each article excerpt.

The Guardian - ABC News - Reuters - Bloomberg

I'm trying to imitate that.

For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/ I want to return The Washington Times

How is this possible with php?

 Answers

94

My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.

<?php

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/<title>(.*)</title>/i",$str,$title); // ignore case
    return $title[1];
  }
}
//Example:
echo get_title("http://www.washingtontimes.com/");

?>

OUTPUT

Washington Times - Politics, Breaking News, US and World News

As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.

http://www.washingtontimes.com/ => The Washington Times

Wednesday, March 31, 2021
 
ariel
answered 7 Months ago
42

This is the way it should be:

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://example.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
Wednesday, March 31, 2021
 
Gilko
answered 7 Months ago
44

Here is how the referred code example supposed to work:

product-category.php

<?php
define( 'INCLUDE_DIR', dirname( __FILE__ ) . '/' );

$rules = array( 
    'redirect-category'  => "/product-category\.php\?cat_id=(?'category'\d+)"
);

$uri = $_SERVER['REQUEST_URI'];
$uri = urldecode( $uri );

foreach ( $rules as $action => $rule ) {
    if ( preg_match( '#'.$rule.'#', $uri, $params ) ) {
        include( INCLUDE_DIR . $action . '.php' );
        exit();
    }
}

redirect-category.php

<?php
$categories = array(
    2 => 'men-items'
);
header("Location: " . preg_replace( '#'.$rule.'#', "/product_category/{$categories[$params['category']]}/", $uri ));
Saturday, May 29, 2021
 
the_e
answered 5 Months ago
55

You can use this regex:

#(s|^)((?:https?://)?w+(?:.w+)+(?<=.(net|org|edu|com))(?:/[^s]*|))(?=s|b)#is

Code:

$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn't it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',

);
foreach($arr as $url) {   
   $link = preg_replace_callback('#(s|^)((?:https?://)?w+(?:.w+)+(?<=.(net|org|edu|com))(?:/[^s]*|))(?=s|b)#is',
           create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
               return $m[1]."<a href="http://".$m[2]."">".$m[2]."</a>"; else return $m[1]."<a href="".$m[2]."">".$m[2]."</a>";'),
           $url);
   echo $link . "n";

OUTPUT:

<a href="http://www.domain.com/?foo=bar">http://www.domain.com/?foo=bar</a>
http://www.that"sallfolks.com
This is really cool site: <a href="https://www.domain.net">https://www.domain.net</a>/ isn't it?
<a href="http://subdomain.domain.org">http://subdomain.domain.org</a>
<a href="http://www.domain.com/folder">www.domain.com/folder</a>
Hello! You can visit <a href="http://vertigofx.com/mysite/rocks">vertigofx.com/mysite/rocks</a> for some awesome pictures, or just go to <a href="http://vertigofx.com">vertigofx.com</a> by itself
<a href="http://subdomain.domain.net">subdomain.domain.net</a>
<a href="http://subdomain.domain.edu/folder/subfolder">subdomain.domain.edu/folder/subfolder</a>
Hello! Check out my site at <a href="http://domain.net">domain.net</a>!
welcome.to.computers
Hello.Come visit <a href="http://oursite.com">oursite.com</a>!
foo.bar
<a href="http://domain.com/folder">domain.com/folder</a>

PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.

Friday, July 30, 2021
 
konstantin
answered 3 Months ago
100

For the regex, use:

document.title = document.title.replace (/[^0-9:]/g, "");

To detect title changes, use MutationObservers, a new HTML5 feature that is implemented in both Google Chrome and Firefox (The two main userscripts browsers).

This complete script will work:

// ==UserScript==
// @name        Shakes & Fidget Buffed title shortener
// @namespace   http://släcker.de
// @version     0.1
// @description  Removes the page title of Shakes & Fidget to only display left time if it exists
// @include     *.sfgame.*
// @exclude     www.sfgame.*
// @exclude     sfgame.*
// @copyright   2013+, slaecker, Stack Overflow
// @grant       GM_addStyle
// ==/UserScript==
/*- The @grant directive is needed to work around a design change
    introduced in GM 1.0.   It restores the sandbox.
*/

var MutationObserver = window.MutationObserver || window.WebKitMutationObserver;
var myObserver       = new MutationObserver (titleChangeDetector);
var obsConfig        = {
    //-- Subtree needed.
    childList: true, characterData: true, subtree: true
};

myObserver.observe (document, obsConfig);

function titleChangeHandler () {
    this.weInitiatedChange      = this.weInitiatedChange || false;
    if (this.weInitiatedChange) {
        this.weInitiatedChange  = false;
        //-- No further action needed
    }
    else {
        this.weInitiatedChange  = true;
        document.title = document.title.replace (/[^0-9:]/g, "");
    }
}

function titleChangeDetector (mutationRecords) {

    mutationRecords.forEach ( function (mutation) {
        //-- Sensible, Firefox
        if (    mutation.type                       == "childList"
            &&  mutation.target.nodeName            == "TITLE"
        ) {
            titleChangeHandler ();
        }
        //-- WTF, Chrome
        else if (mutation.type                      == "characterData"
            &&  mutation.target.parentNode.nodeName == "TITLE"
        ) {
            titleChangeHandler ();
        }
    } );
}

//-- Probably best to wait for first title change, but uncomment the next line if desired.
//titleChangeHandler ();

If you are using some other browser (state that in the question), then fallback to using setInterval().

Monday, August 30, 2021
 
DNB5brims
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :