Getting Subdomain from an URL in .Net

by Nimesh 1. November 2010 23:13

Recently, I was writing a logic which required stripping out all the bits and pieces from an url.

For instance,

Request Host URL : a.b.c.domain.com.au  has to be separated into

  • subdomain = a.b.c
  • domain = domain
  • com = com
  • countryextension = au

Note: This example was to suit my scenario and I may/may-not have covered all the scenarios possible for the URL combinations. You can definitely go ahead and modify the regex to suit your need.

Obviously, the first thing that comes to our mind is writing a regular expression, but hey, why to write a regular expression if it's already written by someone.

So, I started googling up and wasn't successful in finding any regular expression for .Net engine which can strip a host url into subdomain, domain the way I expect. This led me to write my own regular expression.

I came up with following regular expression and the tool that I initially used for coming up with this regular expression was the RegexLib Tester

Regex 1:  ^([\w\.\-]+\.)*([\w\-]+\.)([\w]{2,3})(\.[\w]{2})?$

Test Scenario for Regex1 for our a.b.c.domain.com.au executed on regexlib gave me following groups:

  • $1 = a.b.c.domain
  • $2 = com.
  • $3 = au
  • $4 =

If you have a look at regex groups, it's still not right. The grouping that I expect is

  • $1 = a.b.c.
  • $2 = domain.
  • $3 = com
  • $4 = .au

Note: Ignore '.' placement in above result as it can be trimmed with string manipulation function.

After trying to come up with a valid regex which can serve my purpose, I started looking at the problem at different angle. While playing with the regex for couple of hours, I knew that if the .Net Engine executes my regex from Right to Left, it will give me my expected result. That led me to browse through regex functions/properties which flags the engine to execute the regex from right to left.

I found "RegexOptions.RightToLeft" enum which will definitely serve the purpose for me.

I used Rad Software Regular Expression Designer as my testing tool. As you can see in the image below, after checking RightToLeft option, it gave me the desired output.

rad_capture

After cleaning the regex to return Named Groups, it looked as:

//Final Regex:  
^(?<subdomain>[\w\.\-]+\.)*(?<domain>[\w\-]+\.)(?<com>[\w]{2,3})(?<countryextension>\.[\w]{2})?$

 Groups:

  •  subdomain = a.b.c
  •  domain = domain
  •  com = com
  •  countryextension = au

This can then be used inside a .Net function as below. In my requirement, I had to extract just the domain.com.au part of the Url. Using this approach, which is more robust and generic, you can get any combination from the URL.

    /// <summary>
    /// Input: test.domain.com.au  Output: domain.com.au
    /// Input: xyz.domain.com        Output: domain.com
    /// Input: xyz.domain.co.uk      Output: domain.co.uk
    /// Input: www.domain.com        Output: domain.com
    /// In short, strips out all the subdomain part and returns the rest
    /// of URL.
    /// </summary>
    /// <param name="refUrl">Url which is identical to Request.Url.Host</param>
    /// <returns>Stripped out Url.</returns>
    private static string GetReferrerSiteFromUrl(string refUrl)
    {
        string result = string.Empty;
        string pattern = @"^(?<subdomain>[\w\.\-]+\.)*(?<domain>[\w\-]+\.)(?<com>[\w]{2,3})(?<countryextension>\.[\w]{2})?$";
        
        Regex regex = new Regex(pattern);
        Match match1 = Regex.Match(refUrl, pattern, RegexOptions.RightToLeft);
        if (match1.Success)
        {
            result = match1.Groups["domain"].ToString() + match1.Groups["com"].ToString() + match1.Groups["countryextension"];
        }

        return result;
    }

Tags: ,

.Net | Utilities

Powered by BlogEngine.NET 2.0.0.36
Theme by Mads Kristensen | Modified by Mooglegiant