Amazing substring behaviour
In a recent code review at my workplace I found a piece of C# code that contained something along this line:
string foo = "bar";
string substring = foo.Substring(3);
Clearly index position 3 is beyond the end of the string, so I thought I had found a bug and was about to flag the code. Then it occurred to me: Why had the unit tests not failed during the gated checkin build?
Referring to the documentation I found that the String.Substring()
method indeed returns an empty string when an index position is specified that is exactly equal to the length of the string.
I’m not new to C# and .NET, so I was quite surprised to have found this unexpected behaviour in such a basic library function. In a scripting language such as AWK I would not be surprised to find a lax and forgiving API, but in a strongly typed programming language such as C# I’m expecting things to be more strict. Personally I find the behaviour inconsistent and irritating because references to other illegal index positions such as these
string foo = "bar";
string substring = foo.Substring(3, 1);
string substring = foo.Substring(4);
both do throw an ArgumentOutOfRangeException
!
How do other programming languages behave? I fired up the online compiler Ideone in the browser and made a few comparisons…
C#
using System;
public class Test
{
public static void Main()
{
string foo = "bar";
string substring1 = foo.Substring(3); // ok, empty string
string substring2 = foo.Substring(3, 0); // ok, empty string
string substring3 = foo.Substring(3, 1); // ArgumentOutOfRangeException
}
}
C++
#include <iostream>
#include <string>
int main()
{
std::string foo = "bar";
std::string substring1 = foo.substr(3); // ok, empty string
std::string substring2 = foo.substr(3, 0); // ok, empty string
std::string substring3 = foo.substr(3, 1); // ok, empty string
return 0;
}
Objective-C
#import <objc/objc.h>
#import <objc/Object.h>
#import <Foundation/Foundation.h>
@implementation TestObj
int main()
{
NSString* foo = @"bar";
// ok, empty string
NSString* substring1 = [foo substringFromIndex:3];
// ok, empty string
NSString* substring2 = [foo substringWithRange:NSMakeRange(3, 0)];
// NSRangeException
NSString* substring3 = [foo substringWithRange:NSMakeRange(3, 1)];
return 0;
}
@end
Java
import java.util.*;
import java.lang.*;
import java.io.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
String foo = "bar";
String substring1 = foo.substring(3); // ok, empty string
String substring2 = foo.substring(3, 3); // ok, empty string
String substring3 = foo.substring(3, 4); // IndexOutOfBoundsException
}
}
Conclusion
Three out of the four languages that I examined behave the same. Only C++, of all things, is more tolerant than the other languages and doesn’t barf even when a length parameter > 0 is specified.
And the rationale?
I can only speculate why standard library API designers all over the world should agree that an illegal string index position must be allowed for exactly one border case: when the index position is equal to the string length.
One speculation is that it might be convenient for implementing certain loops that iterate over the content of a string.
Another speculation is that null-terminated C strings are at work in the background. The C string “foo” looks like this in memory:
characters: f o o \0
index positions: 0 1 2 3
So one might argue that index position 3 refers to the terminating null byte. But why expose this in the API of a programming language’s standard library when that programming language does not also expose the concept of the string terminating null byte?
It’s a mystery.