October 8, 2012

How to efficiently replace multiple characters in a Java String

The problem of string transformation and manipulation is so common that every programming language offers either special features about it or, at the very least, a standard library with basic solutions.

In Java there's both regular expression matching /replacement via its standard library, and various “replace” operations that may be directly called on a String object. The replace operations return a new String, as in Java there can be no in-place replacement of String contents; String objects are immutable.

One problem that is common enough, in my opinion, to justify the development of a special method to handle it is this: replacing multiple characters in a String at once (in one go).

There are three kinds of replacements that we want to handle:
  1. Replacing a character with another single character. This kind of replacement does not affect the length of the original string.
  2. Replacing a single character with a sequence of characters (a string). This kind of replacement will increase the length of the original string.
  3. Replacing a character with “nothing” i.e. removing the character from the string. This kind of operation will reduce the length of the original string.
We will use a Map<Character,Object> to associate the characters that are going to be replaced (keys in the Map) with the respective replacements. The replacement objects (values in the Map) will be Character objects for case number 1, non empty strings for case number two, and the empty string for case number 3.
 
/**
 * Helper class to facilitate replacing multiple characters in a String.
 * 
 * @author 0xReinventor
 */
public final class StringConvertor
{
   private HashMap<Character, Object> _map = new HashMap<Character, Object>();

   /**
    * Adds a character that will be replaced by a non-empty string.
    * 
    * @param oldChar
    * @param newString
    * @return
    */
   public StringConvertor put(char oldChar, String newString)
   {
      _map.put(oldChar, newString);
      return this;
   }

   /**
    * Adds characters that will be replaced by other single characters.
    * 
    * @param oldChars
    *           The characters in this string will be replaced by the respective
    *           character in the newChars argument (i.e. by the character at the
    *           same index).
    * @param newChars
    * @return
    */
   public StringConvertor putCharsToReplace(String oldChars, String newChars)
   {
      if (newChars.length() < oldChars.length())
      {
         throw new IllegalArgumentException("Not enough 'newChars'.");
      }
      for (int i = 0; i < oldChars.length(); i += 1)
      {
         _map.put(oldChars.charAt(i), newChars.charAt(i));
      }
      return this;
   }

   /**
    * Adds characters that will be removed from the string.
    * 
    * @param chars
    * @return
    */
   public StringConvertor putCharsToDelete(String chars)
   {
      for (int i = 0; i < chars.length(); i += 1)
      {
         _map.put(chars.charAt(i), "");
      }
      return this;
   }

   public String replaceAll(String str)
   {
      // TODO replace all characters in str that are mentioned in the _map as keys
      // with the associated values found in the _map.
      return null;
   }
}

As an example, if we wanted to to convert “<” and “>” to HTML entities, we could use this object like this:
StringConvertor conv = new StringConvertor().put('<', "&lt;").put('>', "&gt;");
System.out.println(conv.replaceAll(
      "a < b < c => a < c"));
And the output would be:
a &lt; b &lt; c =&gt; a &lt; c
A possible implementation of replaceAll() would be:
   /**
    * Replaces all occurrences of characters in str that match the characters in
    * the StringConvertor object with the associated replacements.
    * 
    * @param str
    * @return
    */
   public String replaceAll(String str)
   {
      Object mapped = null;
      int i = -1;
      while (++i < str.length()) // Skip characters that do not need to be
                                 // replaced.
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            break;
         }
      }
      if (mapped == null) // No need to replace any characters in the string.
      {
         return str;
      }
      StringBuilder dest = new StringBuilder();
      dest.append(str, 0, i).append(mapped);
      int start = -1;
      while (++i < str.length())
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            if (start >= 0) // First append any characters that do not need
                            // conversion.
            {
               dest.append(str, start, i);
               start = -1;
            }
            if (mapped instanceof Character)
            {
               // Make sure to call append(char c) and not append(Object o).
               dest.append(((Character) mapped).charValue());
            }
            else if (((String) mapped).length() > 0) // Skip the call for empty
                                                     // strings.
            {
               dest.append((String) mapped);
            }
         }
         else
         {
            if (start < 0) // Mark the beginning of a segment that does not need
                           // conversion.
            {
               start = i;
            }
         }
      }
      if (start >= 0) // Don't forget to append any pending characters.
      {
         dest.append(str, start, str.length());
      }
      return dest.toString();
   }

The implementation takes care not to duplicate the original string, if there are no characters to be converted. Also, when there are characters to be converted, it will avoid copying characters one-by-one; i.e. the ones that do not need conversion will be copied in chunks as large as possible.

This is the fastest implementation that I could come up with. It may be optimised a bit further, if we only need to cover case 1: replacing characters with single characters (not with strings, and no removal of characters). In this case, we could achieve even better performance using this implementation:
 
/**
 * Optimized convertor to replace multiple characters strictly with single characters.
 * 
 * @author 0xReinventor
 *
 */
public final class CharacterConvertor
{
   private HashMap<Character, Character> _map = new HashMap<Character, Character>();

   public CharacterConvertor(String oldChars, String newChars)
   {
      super();
      if (newChars.length() < oldChars.length())
      {
         throw new IllegalArgumentException("Not enough 'newChars'.");
      }
      for (int i = 0; i < oldChars.length(); i += 1)
      {
         if (_map.put(oldChars.charAt(i), newChars.charAt(i)) != null)
         {
            throw new IllegalStateException("Character already mapped: "
                  + oldChars.charAt(i));
         }
      }
   }

   public String replaceAll(String str)
   {
      Character mapped = null;
      int i = -1;
      while (++i < str.length())
      {
         mapped = _map.get(str.charAt(i));
         if (mapped != null)
         {
            break;
         }
      }
      if (mapped == null)
      {
         return str;
      }
      char[] buff = str.toCharArray();
      buff[i] = mapped;
      while (++i < buff.length)
      {
         mapped = _map.get(buff[i]);
         if (mapped != null)
         {
            buff[i] = mapped;
         }
      }
      return String.valueOf(buff);
   }
}
In this last implementation "replaceAll()" performed at least 20% faster, in my tests, than the version in the more general StringConvertor class (using the same test data). I was also surprised to find out that -in converting upper case English text to lower case- it was a bit slower but comparable to the the built-in "toLowerCase()" method.

September 30, 2012

Observable properties: A means to track state changes in Java objects

An example application of the Observer design pattern using Java


In one of my recent pet projects I needed objects that should be "aware" that one or more of their properties had changed. The obvious way to do it of course is to write code inside the "setter method" of each property and call whatever code must be executed in the event of a change. Something like that:

void setName(String name) {
  this.name = name;
  onNameChanged();
}

void onNameChanged() {
  // do whatever is necessary here
}

But what I really wanted was to attach an "on change" event to some of the object's properties and handle this event in a systematic way. I thought this was a valid opportunity to use the Observer design pattern and have the properties of the object notify their owner that they had changed.

We can assume that each property is not any more a simple object of a given type T, but an instance of an "observable object" that contains a value of type T:

public final class ObservableProperty<T> {

  private PropertyObserver _observer;
  private T _value;

  public T getValue() {
    return _value;
  }

  public void setValue(T newValue) {
    _value = newValue;
    _observer.notify();
  }
}

The observable object must be able to notify the observer by calling its notify() method, so the PropertyObserver object must have a method about like that:

public class PropertyObserver {
  // other methods and fields
  void notify();
  // other methods and fields
}

The above solution, although valid, seems a little spartan. The PropertyObserver may be notified by calls to its notify() method, but it receives no information whatsoever about which property has changed or how it has changed. We should create a more interesting IPropertyObserver interface:

public interface IPropertyObserver {
  void onPropertyChanged(ObservableProperty<?> sender, Object previousValue);
}

With this new implementation, the property observer object can be notified that a property has changed and also it may receive information about which property changed (the “sender” argument) and what was the value of the property before the change (the “previousValue” argument).

To cooperate with IPropertyObserver, the implementation of the ObservableProperty generic class must now change to something like this:

public final class ObservableProperty<T> {

  private final IPropertyObserver _observer;
  private T _value;

  public T getValue() {
    return _value;
  }

  public void setValue(T newValue) {
    if (newValue == _value || (newValue != null && newValue.equals(_value))) {
      return; // The property has not changed
    }
    Object previousValue = _value;
    _value = newValue;
    _observer.onPropertyChanged(this, previousValue); // Notify the observer
                                                      // about the change
  }

  public ObservableProperty(IPropertyObserver observer) {
    super();
    this._observer = observer;
  }
}

This new implementation also checks to see that the internal value of the property has in fact changed, before calling onPropertyChanged().

To try out this simple solution, let's create a simple Contact object; an object to hold the name and contact information of a person:

class Contact implements IPropertyObserver {

  public final ObservableProperty<String> name = new ObservableProperty<String>(this);
  public final ObservableProperty<Date> birthDate = new ObservableProperty<Date>(this);
  public final ObservableProperty<String> mobile = new ObservableProperty<String>(this);
  public final ObservableProperty<String> email = new ObservableProperty<String>(this);

  public void onPropertyChanged(ObservableProperty<?> sender, Object previousValue) {
  }
}

The Contact class, above, has four properties (name, birthDate, email, mobile) that are instances of ObservableProperty. Also, it implements the interface IPropertyObserver, so that it may be notified every time one of its observable properties has changed.

We now have a central place, where we can handle the event of a change in the values of the observable properties. The following code, for instance, will reflect on the object to find the property's name and then display an informative message reporting both the new and the old value of the property:

public void onPropertyChanged(ObservableProperty<?> sender, Object previousValue) {
  try {
    for (Field f : Contact.class.getDeclaredFields()) {
      if (f.get(this) == sender) {
        System.out.println(String.format("%33s %-28s (was: %s)", "'" +
          f.getName() + "' changed to:", sender.getValue(), previousValue));
      }
    }
  }
  catch (RuntimeException e) {
    throw e;
  }
  catch (Exception e) {
    throw new RuntimeException(e);
  }
}

September 27, 2012

Quicksort rewritten in tail-recursive form

An example using the Scala programming language


I am currently learning Scala and I became quite intrigued by the compiler's ability to optimize tail-recursive calls.

For those unfamiliar with the concept, a tail-recursive method can be optimized to be executed in constant stack space. This practically means that the recursion is not really necessary and that the method in question could be implemented as a simple iterative process! Nevertheless, it is nice to have the compiler do the trick of converting the recursion into a loop and let the programmers express their ideas using recursion.

As an exercise, I set out to convert the renown Quicksort algorithm into tail-recursive form. But first, why would this algorithm need any conversion whatsoever?

Let us see an implementation of the algorithm written in Scala. I am using the following piece of code as it appears in “Scala By Example” by Martin Odersky (draft of May 24, 2011). It is a straightforward implementation of Quicksort for arrays of integers:

  def quickSort_scalaByExample(xs: Array[Int]) {
    def swap(i: Int, j: Int) {
      val t = xs(i); xs(i) = xs(j); xs(j) = t
    }
    def sort1(l: Int, r: Int) {
      val pivot = xs((l + r) / 2)
      var i = l
      var j = r
      while (i <= j) {
        while (xs(i) < pivot) i += 1
        while (xs(j) > pivot) j -= 1
        if (i <= j) {
          swap(i, j)
          i += 1
          j -= 1
        }
      }
      if (l < j) sort1(l, j)
      if (i < r) sort1(i, r)
    }
    sort1(0, xs.length - 1)
  }

The recursive method sort1() may call itself up to two times at the end of each execution. This is the reason why the method is not tail-recursive. Indeed, if we use the @tailrec annotation in the definition of sort1(), the compiler will complain with this error:
could not optimize @tailrec annotated method sort1: it contains a recursive call not in tail position
The message refers to the line if (l < j) sort1(l, j).

To enable the optimization, we need to make sure that the recursive call may happen at most once at the end of the execution of sort1().

First, let's rewrite the algorithm so that the “sorting” part of it is factored out of the recursion. This is the part where elements are being compared against each other and swapped inside the array, so that the pivot element gets its “correct” position in the array (for an explanation of Quicksort you may read this Wikipedia article). We introduce a new method sortRange(), which will take a lower and upper bound and perform the “sorting work” on this segment of the array:

  def quickSort_intermediate(xs: Array[Int]) {
    def swap(i: Int, j: Int) {
      val t = xs(i); xs(i) = xs(j); xs(j) = t
    }
    def sortRange(l: Int, r: Int): (Int, Int) = {
      val pivot = xs((l + r) / 2)
      var i = l
      var j = r
      while (i <= j) {
        while (xs(i) < pivot) i += 1
        while (xs(j) > pivot) j -= 1
        if (i <= j) {
          swap(i, j)
          i += 1
          j -= 1
        }
      }
      return (i, j)
    }
    def sort1(l: Int, r: Int) {
      val i_j = sortRange(l, r);
      val i = i_j._1
      val j = i_j._2
      if (l < j) sort1(l, j)
      if (i < r) sort1(i, r)
    }
    sort1(0, xs.length - 1)
  }

This new method does the sorting and then returns the values of “i” and “j” as a tuple, because these are needed to continue the process. The sort1() method still takes the same arguments as before and is in fact, again, not at all tail-recursive! Perhaps annoyingly, it does not even make use of Scala's idiomatic syntax for tuples, so we rewrite it to this more natural Scala form:

    def sort1(l: Int, r: Int) {
      sortRange(l, r) match {
        case (i, j) => {
          if (l < j) sort1(l, j)
          if (i < r) sort1(i, r)
        }
      }
    }

So, up to this point, we have broken down the initial implementation into a method “sortRange()” that can “do the sorting work” on segments of the array, and a recursive method “sort1()” that calls itself up to two times in every execution, using as arguments new segments of the array.

The recursive method sort1() takes as arguments the lower and upper limits of the array where it is going to work on. We are going to replace this method with one that takes as its only argument “a list of array segments” that remain to be sorted. Or, more precisely, a list of tuples containing the lower and upper bounds of the array segments that need to be sorted:

    def sort2(segments: List[(Int, Int)]) {
      segments.head match {
        case (l, r) => {
          var newSegments = segments.tail
          sortRange(l, r) match {
            case (i, j) => {
              if (l < j) newSegments = (l, j) :: newSegments
              if (i < r) newSegments = (i, r) :: newSegments
            }
          }
          if (!newSegments.isEmpty) sort2(newSegments)
        }
      }
    }

What does the new method do?
  • It takes a list of “segments of the array” (i.e. the upper and lower bounds of each segment) and calls sortRange() on the head of the list.
  • Then, according to the results of sortRange(), it may add one or two more segments to the tail of the original list of segments that need to be sorted.
  • Finally, it recursively calls itself at the end, with the new list of segments as an argument.
This version of the code is tail-recursive, and the compiler will certainly not complain if we add the @tailrec annotation. Here's the complete implementation:

  def quickSort_tailRecursive(xs: Array[Int]) {
    def swap(i: Int, j: Int) {
      val t = xs(i); xs(i) = xs(j); xs(j) = t
    }
    def sortRange(l: Int, r: Int): (Int, Int) = {
      val pivot = xs((l + r) / 2)
      var i = l
      var j = r
      while (i <= j) {
        while (xs(i) < pivot) i += 1
        while (xs(j) > pivot) j -= 1
        if (i <= j) {
          swap(i, j)
          i += 1
          j -= 1
        }
      }
      return (i, j)
    }
    @tailrec def sort2(segments: List[(Int, Int)]) {
      segments.head match {
        case (l, r) => {
          var newSegments = segments.tail
          sortRange(l, r) match {
            case (i, j) => {
              if (l < j) newSegments = (l, j) :: newSegments
              if (i < r) newSegments = (i, r) :: newSegments
            }
          }
          if (!newSegments.isEmpty) sort2(newSegments)
        }
      }
    }
    sort2(List((0, xs.length - 1)))
  }

Conclusion


It was not so hard to convert the classic implementation of Quicksort into a from that is eligible for tail-recursion optimization using the Scala language. We could go ahead and convert this form into an iterative process even -but why not let the compiler do this step?

The question that arises naturally is: Is this implementation more efficient or better in any way than the classic implementation? I would answer yes AND no.

In the process of removing the second recursive call from the original code (from method sort1()), we introduced a method that needs an argument that is a list of things, while in the original form we had simple arguments of constant size. This list expands in size over time, using space in the heap. So, in effect, we traded heap space for stack space.

Arguably this is a good thing; as far as I know most machines (and this includes the Java virtual machine on its default settings) have a more limited stack space compared to their (sometimes virtually unlimited) heap space. With this in mind I believe that -under normal circumstances- the tail-recursive implementation would be able to operate on larger arrays than the classic method, without being prone to a stack overflow error.