Asked  6 Months ago    Answers:  5   Viewed   19 times

I would like to match just the root of a URL and not the whole URL from a text string. Given:

http://www.youtube.com/watch?v=ClkQA2Lb_iE
http://youtu.be/ClkQA2Lb_iE
http://www.example.com/12xy45
http://example.com/random

I want to get the 2 last instances resolving to the www.example.com or example.com domain.

I heard regex is slow and this would be my second regex expression on the page so If there is anyway to do it without regex let me know.

I'm seeking a JS/jQuery version of this solution.

 Answers

99

I recommend using the npm package psl (Public Suffix List). The "Public Suffix List" is a list of all valid domain suffixes and rules, not just Country Code Top-Level domains, but unicode characters as well that would be considered the root domain (i.e. www.??.??.cn, b.c.kobe.jp, etc.). Read more about it here.

Try:

npm install --save psl

Then with my "extractHostname" implementation run:

let psl = require('psl');
let url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
psl.get(extractHostname(url)); // returns youtube.com

I can't use an npm package, so below only tests extractHostname.

function extractHostname(url) {
    var hostname;
    //find & remove protocol (http, ftp, etc.) and get hostname

    if (url.indexOf("//") > -1) {
        hostname = url.split('/')[2];
    }
    else {
        hostname = url.split('/')[0];
    }

    //find & remove port number
    hostname = hostname.split(':')[0];
    //find & remove "?"
    hostname = hostname.split('?')[0];

    return hostname;
}

//test the code
console.log("== Testing extractHostname: ==");
console.log(extractHostname("http://www.blog.classroom.me.uk/index.php"));
console.log(extractHostname("http://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("https://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("ftps://ftp.websitename.com/dir/file.txt"));
console.log(extractHostname("websitename.com:1234/dir/file.txt"));
console.log(extractHostname("ftps://websitename.com:1234/dir/file.txt"));
console.log(extractHostname("example.com?param=value"));
console.log(extractHostname("https://facebook.github.io/jest/"));
console.log(extractHostname("//youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("http://localhost:4200/watch?v=ClkQA2Lb_iE"));

// Warning: you can use this function to extract the "root" domain, but it will not be as accurate as using the psl package.

function extractRootDomain(url) {
    var domain = extractHostname(url),
        splitArr = domain.split('.'),
        arrLen = splitArr.length;

    //extracting the root domain here
    //if there is a subdomain 
    if (arrLen > 2) {
        domain = splitArr[arrLen - 2] + '.' + splitArr[arrLen - 1];
        //check to see if it's using a Country Code Top Level Domain (ccTLD) (i.e. ".me.uk")
        if (splitArr[arrLen - 2].length == 2 && splitArr[arrLen - 1].length == 2) {
            //this is using a ccTLD
            domain = splitArr[arrLen - 3] + '.' + domain;
        }
    }
    return domain;
}

//test extractRootDomain
console.log("== Testing extractRootDomain: ==");
console.log(extractRootDomain("http://www.blog.classroom.me.uk/index.php"));
console.log(extractRootDomain("http://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractRootDomain("https://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractRootDomain("www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractRootDomain("ftps://ftp.websitename.com/dir/file.txt"));
console.log(extractRootDomain("websitename.co.uk:1234/dir/file.txt"));
console.log(extractRootDomain("ftps://websitename.com:1234/dir/file.txt"));
console.log(extractRootDomain("example.com?param=value"));
console.log(extractRootDomain("https://facebook.github.io/jest/"));
console.log(extractRootDomain("//youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractRootDomain("http://localhost:4200/watch?v=ClkQA2Lb_iE"));

Regardless having the protocol or even port number, you can extract the domain. This is a very simplified, non-regex solution, so I think this will do.

*Thank you @Timmerz, @renoirb, @rineez, @BigDong, @ra00l, @ILikeBeansTacos, @CharlesRobertson for your suggestions! @ross-allen, thank you for reporting the bug!

Tuesday, June 1, 2021
 
toesslab
answered 6 Months ago
36

Since you're trying to use stringr, I recommend str_extract (I'd recommend it even if you weren't trying to use stringr):

x <- c('RED LOBTSER CA04606', 'Red Lobster NewYork WY245')
str_extract(x, '[a-zA-Z ]+\b')
# [1] "RED LOBSTER "          "Red Lobster NewYork "

The 'b' in the regex prevents the 'CA' from 'CA04606' being extracted.

If you don't like that trailing space you could use str_trim to remove it, or you could modify the regex:

str_extract(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\b')
# [1] "RED LOBSTER"          "Red Lobster NewYork"

Note - if your string has non-numbers after the post code, the above only returns the words before. So in the example below, if you wanted to get the 'NewYork' after the 'WY245', you can use str_extract_all and paste the results together:

x <- c(x, 'Red Lobster WY245 NewYork')
str_extract_all(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\b')
# [[1]]
# [1] "RED LOBSTER"
# 
# [[2]]
# [1] "Red Lobster NewYork"
# 
# [[3]]
# [1] "Red Lobster" "NewYork"    

# Paste the bits together with paste(..., collapse=' ')
sapply(str_extract_all(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\b'), paste, collapse=' ')
# [1] "RED LOBSTER"          "Red Lobster NewYork" "Red Lobster NewYork"
Thursday, August 5, 2021
 
tika
answered 4 Months ago
88

Use an old text parsing trick that greatly increases the distance between the words with repeating zeroes through the SUBSTITUTE and REPT functions which affords a larger swipe of the intended substring.

      Parse Substring with SUBSTITUTE and REPT

The formula in B2 is,

=TRIM(MID(SUBSTITUTE(A2, " ", REPT(" ", 99)), MAX(1, FIND("=", SUBSTITUTE(A2, " ", REPT(" ", 99)))-50), 99))

The TRIM function (used as a wrapper) removes leading and trailing spaces.

Friday, August 20, 2021
 
Skipper
answered 4 Months ago
32

Like so:

new RegExp(q + " Gonzalez", "i");

Using the / characters is how to define a RegExp with RegExp literal syntax. To create a RegExp from a string, pass the string to the RegExp constructor. These are equivalent:

var expr = /Josh Gonzalez/i;
var expr = new RegExp("Josh Gonzalez", "i");

The way you have it you are passing a regular expression to the regular expression constructor... it's redundant.

Monday, November 15, 2021
 
Gil
answered 2 Weeks ago
Gil
62

You are mixing both the string regex notation and the literal regex notation. Choose only one of:

new RegExp('ab+c');
new RegExp(/ab+c/);

in your case maybe something like this instead:

var mobileNumber = new RegExp(/[0-9-()+]{3,20}/);
Wednesday, November 24, 2021
 
alioygur
answered 5 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share