October 8, 2012

How to efficiently replace multiple characters in a Java String

The problem of string transformation and manipulation is so common that every programming language offers either special features about it or, at the very least, a standard library with basic solutions.

In Java there's both regular expression matching /replacement via its standard library, and various “replace” operations that may be directly called on a String object. The replace operations return a new String, as in Java there can be no in-place replacement of String contents; String objects are immutable.

One problem that is common enough, in my opinion, to justify the development of a special method to handle it is this: replacing multiple characters in a String at once (in one go).

There are three kinds of replacements that we want to handle:
  1. Replacing a character with another single character. This kind of replacement does not affect the length of the original string.
  2. Replacing a single character with a sequence of characters (a string). This kind of replacement will increase the length of the original string.
  3. Replacing a character with “nothing” i.e. removing the character from the string. This kind of operation will reduce the length of the original string.
We will use a Map<Character,Object> to associate the characters that are going to be replaced (keys in the Map) with the respective replacements. The replacement objects (values in the Map) will be Character objects for case number 1, non empty strings for case number two, and the empty string for case number 3.
 
/**
 * Helper class to facilitate replacing multiple characters in a String.
 * 
 * @author 0xReinventor
 */
public final class StringConvertor
{
   private HashMap<Character, Object> _map = new HashMap<Character, Object>();

   /**
    * Adds a character that will be replaced by a non-empty string.
    * 
    * @param oldChar
    * @param newString
    * @return
    */
   public StringConvertor put(char oldChar, String newString)
   {
      _map.put(oldChar, newString);
      return this;
   }

   /**
    * Adds characters that will be replaced by other single characters.
    * 
    * @param oldChars
    *           The characters in this string will be replaced by the respective
    *           character in the newChars argument (i.e. by the character at the
    *           same index).
    * @param newChars
    * @return
    */
   public StringConvertor putCharsToReplace(String oldChars, String newChars)
   {
      if (newChars.length() < oldChars.length())
      {
         throw new IllegalArgumentException("Not enough 'newChars'.");
      }
      for (int i = 0; i < oldChars.length(); i += 1)
      {
         _map.put(oldChars.charAt(i), newChars.charAt(i));
      }
      return this;
   }

   /**
    * Adds characters that will be removed from the string.
    * 
    * @param chars
    * @return
    */
   public StringConvertor putCharsToDelete(String chars)
   {
      for (int i = 0; i < chars.length(); i += 1)
      {
         _map.put(chars.charAt(i), "");
      }
      return this;
   }

   public String replaceAll(String str)
   {
      // TODO replace all characters in str that are mentioned in the _map as keys
      // with the associated values found in the _map.
      return null;
   }
}

As an example, if we wanted to to convert “<” and “>” to HTML entities, we could use this object like this:
StringConvertor conv = new StringConvertor().put('<', "&lt;").put('>', "&gt;");
System.out.println(conv.replaceAll(
      "a < b < c => a < c"));
And the output would be:
a &lt; b &lt; c =&gt; a &lt; c
A possible implementation of replaceAll() would be:
   /**
    * Replaces all occurrences of characters in str that match the characters in
    * the StringConvertor object with the associated replacements.
    * 
    * @param str
    * @return
    */
   public String replaceAll(String str)
   {
      Object mapped = null;
      int i = -1;
      while (++i < str.length()) // Skip characters that do not need to be
                                 // replaced.
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            break;
         }
      }
      if (mapped == null) // No need to replace any characters in the string.
      {
         return str;
      }
      StringBuilder dest = new StringBuilder();
      dest.append(str, 0, i).append(mapped);
      int start = -1;
      while (++i < str.length())
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            if (start >= 0) // First append any characters that do not need
                            // conversion.
            {
               dest.append(str, start, i);
               start = -1;
            }
            if (mapped instanceof Character)
            {
               // Make sure to call append(char c) and not append(Object o).
               dest.append(((Character) mapped).charValue());
            }
            else if (((String) mapped).length() > 0) // Skip the call for empty
                                                     // strings.
            {
               dest.append((String) mapped);
            }
         }
         else
         {
            if (start < 0) // Mark the beginning of a segment that does not need
                           // conversion.
            {
               start = i;
            }
         }
      }
      if (start >= 0) // Don't forget to append any pending characters.
      {
         dest.append(str, start, str.length());
      }
      return dest.toString();
   }

The implementation takes care not to duplicate the original string, if there are no characters to be converted. Also, when there are characters to be converted, it will avoid copying characters one-by-one; i.e. the ones that do not need conversion will be copied in chunks as large as possible.

This is the fastest implementation that I could come up with. It may be optimised a bit further, if we only need to cover case 1: replacing characters with single characters (not with strings, and no removal of characters). In this case, we could achieve even better performance using this implementation:
 
/**
 * Optimized convertor to replace multiple characters strictly with single characters.
 * 
 * @author 0xReinventor
 *
 */
public final class CharacterConvertor
{
   private HashMap<Character, Character> _map = new HashMap<Character, Character>();

   public CharacterConvertor(String oldChars, String newChars)
   {
      super();
      if (newChars.length() < oldChars.length())
      {
         throw new IllegalArgumentException("Not enough 'newChars'.");
      }
      for (int i = 0; i < oldChars.length(); i += 1)
      {
         if (_map.put(oldChars.charAt(i), newChars.charAt(i)) != null)
         {
            throw new IllegalStateException("Character already mapped: "
                  + oldChars.charAt(i));
         }
      }
   }

   public String replaceAll(String str)
   {
      Character mapped = null;
      int i = -1;
      while (++i < str.length())
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            break;
         }
      }
      if (mapped == null)
      {
         return str;
      }
      char[] buff = str.toCharArray();
      buff[i] = mapped;
      while (++i < buff.length)
      {
         mapped = _map.get(buff[i]);
         if (mapped != null)
         {
            buff[i] = mapped;
         }
      }
      return String.valueOf(buff);
   }
}
In this last implementation "replaceAll()" performed at least 20% faster, in my tests, than the version in the more general StringConvertor class (using the same test data). I was also surprised to find out that -in converting upper case English text to lower case- it was a bit slower but comparable to the the built-in "toLowerCase()" method.

No comments:

Post a Comment