Bad Word Filter With Regular Expressions

I have seen many versions of these and a lot of the time people are expecting that a bad word would be written complete, I.e. BADWORD.  Sometimes they overlook the fact that others get hold of this rule and simply bypass by adding symbols in between, I.e. B*A*D*W*O*R*D.  Of course this would not be recognized if simply searching the string for BADWORD.

This technique I have used here relies on a base list in XML.  I have created a class which is called BarWordFilter and with this I use the singleton pattern.  I do this because the class has to first compile a list of Regexs from the words inside the base XML File, and as I do not want a re compilation of these at every bad word check, I have opted for the singleton pattern.

for any word which is in the list the rendered pattern will follow a set trend.  So if we look again at BADWORD, the regular expression I have come with would be as follows.

Hide Code [-]
([b|B][\W]*[a|A][\W]*[d|D][\W]*[w|W][\W]*[o|O][\W]*[r|R][\W]*[d|D][\W]*)
{..} Click Show Code

 

What I do is I create the pattern at runtime.  I look for instances of lower or upper case, and ultimately anything which, if we ignore anything which is not a character, spells our bad word.

 

I have create a simple test page here to have a go.  Please note I have only got the real serious words in the list for the purposes of this demonstration.  I have not published this list as I do not think it is necessary.  I have used a simple XML structure so please feel free to copy the code here, and generate as many bad words as you like <s>.

 

Example Page : http://andrewrea.co.uk/badwordfilter/Default.aspx

 

The BadWordFilter class

Hide Code [-]
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using System.Xml;

/// <summary>
/// Summary description for BadWordFilter
/// </summary>
public class BadWordFilter
{

    /// <summary>
    /// These are the options which I use in order to determine the way I handle any bad text
    /// </summary>
    public enum CleanUpOptions
    {
        ReplaceEachWord,
        BlankBadText,
        ReplaceWholeText
    }

    /// <summary>
    /// Private constructor and instantiate the list of regex
    /// </summary>
    private BadWordFilter()
    {
        //
        // TODO: Add constructor logic here
        //
        patterns = new List<Regex>();
    }

    /// <summary>
    /// The patterns
    /// </summary>
    private List<Regex> patterns;

    
    public List<Regex> Patterns
    {
        get { return patterns; }
        set { patterns = value; }
    }

    private static BadWordFilter m_instance = null;

    public static BadWordFilter Instance
    {
        get
        {
            if (m_instance == null)
                m_instance = CreateBadWordFilter(HttpContext.Current.Server.MapPath("listofwords.xml"));

            return m_instance;
        }
    }

    /// <summary>
    /// Create all the patterns required and add them to the list
    /// </summary>
    /// <param name="badWordFile"></param>
    /// <returns></returns>
    protected static BadWordFilter CreateBadWordFilter(string badWordFile)
    {
        BadWordFilter filter = new BadWordFilter();
        XmlDocument badWordDoc = new XmlDocument();
        badWordDoc.Load(badWordFile);

        //Loop through the xml document for each bad word in the list
        for (int i = 0; i < badWordDoc.GetElementsByTagName("word").Count; i++)
        {
            //Split each word into a character array
            char[] characters = badWordDoc.GetElementsByTagName("word")[i].InnerText.ToCharArray();
            
            //We need a fast way of appending to an exisiting string
            StringBuilder patternBuilder = new StringBuilder();

            //The start of the patterm
            patternBuilder.Append("(");

            //We next go through each letter and append the part of the pattern.
            //It is this stage which generates the upper and lower case variations
            for (int j = 0; j < characters.Length; j++)
            {
                patternBuilder.AppendFormat("[{0}|{1}][\\W]*", characters[j].ToString().ToLower(), characters[j].ToString().ToUpper());
            }

            //End the pattern
            patternBuilder.Append(")");

            //Add the new pattern to our list.
            filter.Patterns.Add(new Regex(patternBuilder.ToString()));
        }
        return filter;
    }

    /// <summary>
    /// The function which returns the manipulated string
    /// </summary>
    /// <param name="input"></param>
    /// <param name="options"></param>
    /// <returns></returns>
    public string GetCleanString(string input, CleanUpOptions options)
    {
        if (options == CleanUpOptions.BlankBadText)
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we want to return an empty string if we find any bad word
                if (patterns[i].Match(input).Success)
                    return String.Empty;
            }
        }
        else if (options == CleanUpOptions.ReplaceWholeText)
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we want to return a specified statement if we find any bad word
                if (patterns[i].Match(input).Success)
                    return "The text contains unsuitable content";
            }
        }
        else
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we actually replace each instance of any bad word with a specified string.
                input = patterns[i].Replace(input, "**Unsuitable Word**");
            }
        }

        //return the manipulated string
        return input;
    }
}
{..} Click Show Code

 

The XML file which I have used is below.  Dead simple, but does the job.

Hide Code [-]
<?xml version="1.0" encoding="utf-8" ?>
<words>
  <word>bad word</word>
  <word>ugly word</word>
  <word>bla bla bla</word>
</words>
{..} Click Show Code

 

Cheers,

 

Andrew :-)

Published Saturday, May 03, 2008 9:14 AM by REA_ANDREW

Comments

# re: Bad Word Filter With Regular Expressions

Wednesday, August 06, 2008 6:16 PM by Max

56e6655139140b70d813eac8624f4184

<a href="njdokj.info/.../56e6655139140b70d813eac8624f4184">">njdokj.info/.../56e6655139140b70d813eac8624f4184"> njdokj.info/.../56e6655139140b70d813eac8624f4184 </a>

njdokj.info/.../56e6655139140b70d813eac8624f4184

[url]njdokj.info/.../56e6655139140b70d813eac8624f4184[url]

Leave a Comment

(required) 
(required) 
(optional)
(required) 
Page view counter