Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What is the index of an empty string in an empty string? (successfulsoftware.net)
13 points by hermitcrab on Dec 14, 2023 | hide | past | favorite | 56 comments


The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y". The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set. Therefore the empty set is a subset of every set.

A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.

This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.


While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".

In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.

If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.

Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.


Yeah, I think the convention should be to return the smallest valid index such that the predicate is satisfied, or an error value if the predicate isn't satisfied. As you point out, it's the same as when a non-empty string appears multiple times in a given string. So I think 0 is the only reasonable result in this case, assuming 0-based indexing.


>While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.

>Attempting to search an empty string for non-empty or empty substrings should always signal an error

I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.


> The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y".

This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?

> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.

In what follows I'm replacing lower case x with lower case e.

I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.

This is clearer to me:

1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.

2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".

3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.


What reason could there possibly be for looking for an empty string?


Because the string you are trying to find is in some variable which at runtime in some cases for some reason happens to be an empty string and you don't (want to) special case this and just "let it work by itself"?


There are many cases where a user-provided string is passed to the string API. Generally the programmer should handle empty strings as a special case, but when they don't the API needs to return something.


Thanks!


Is it a sensible thing to do: No.

Will users do it anyway: Yes.

We have to handle all possible values the user can provide.


Avoiding the need to special-case empty input.


The invariant is:

     haystack.substring(haystack.indexOf(needle), needle.length()) == needle
And indexOf should return lowest such number.

So it should return 0. LastIndexOf too, because the haystack is empty.

For nonempty haystack it should be

    "Abc".indexOf("") == 0
    "Abc".lastIndexOf("") == 3


I'd rather have this invariant:

  haystack[haystack.indexOf(needle)] == needle
That is, there is no index of the empty string within a string. A nice thing about this is that it also works for non-character arrays:

  "abc".indexOf("") raises "No such entry exception"
  ["", "abc", ""].indexOf("") == 0


This invariant only works for 1-character needles. So no

    "Lorem Ipsum".indexOf("Ipsum")
Which is the main purpose of this function.

Also with your invariant types are different. Not "abc".indexOf("b") but "abc".indexOf('b').


> there isn’t a first position in the string

This isn’t necessarily true if you think of positions as where a vertical-bar cursor would be, that is, between characters. In

"|"

where the double quotes delimit an empty string, the vertical bar (intended to represent the cursor here) has a position between those delimiters.


That is certainly an approach. But wouldn't that make indexOf() and lastIndexof() different in cases you would intuitively expect them to be the same, e.g:

0-based indexOf "a" in "a" would be 0

0-based lastIndexOf "a" in "a" would be 1

?


No, because “the index of a string within a string” would be consistently defined to always be the lower index (or, alternatively, always the upper index). There is no intrinsic relation to the search direction, since all characters of the search string have to be compared regardless of the direction. For example, when searching from the end, the search algorithm could start comparing at index `totalLength - searchStringLength`.

To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.


>it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

Marinated on this a bit more. I think this the most straightforward and internally consistent way to think about this. Thanks for the insight.

So I figure based on this:

1-based first index of v1 in v2 is:

| v1 | v2 | IndexOf(v1,v2) |

|-------|-------------|----------------|

| [] | [] | 1 |

| aba | [] | N/A |

| [] | aba | 1 |

| a | a | 1 |

| a | aba | 1 |

| x | y | N/A |

| world | hello world | 7 |

Where []=empty.

This is the same as Excel FIND() and differs from Javascript indexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).

1-based last index of v1 in v2 is:

| v1 | v2 | LastIndexOf(v1,v2) |

|-------|-------------|--------------------|

| [] | [] | 1 |

| aba | [] | N/A |

| [] | aba | 4 |

| a | a | 1 |

| a | aba | 3 |

| x | y | N/A |

| world | hello world | 7 |

This differs from Javascript lastIndexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).


If you add four spaces at the beginning of a line, Hacker News will treat it as preformatted. Here are you tables like such just for ease of reading.

---

1-based first index of v1 in v2 is:

    |  v1   |     v2      | IndexOf(v1,v2) |
    |-------|-------------|----------------|
    | []    | []          | 1              |
    | aba   | []          | N/A            |
    | []    | aba         | 1              |
    | a     | a           | 1              |
    | a     | aba         | 1              |
    | x     | y           | N/A            |
    | world | hello world | 7              |


1-based last index of v1 in v2 is:

    |  v1   |     v2      | LastIndexOf(v1,v2) |
    |-------|-------------|--------------------|
    | []    | []          | 1                  |
    | aba   | []          | N/A                |
    | []    | aba         | 4                  |
    | a     | a           | 1                  |
    | a     | aba         | 3                  |
    | x     | y           | N/A                |
    | world | hello world | 7                  |


Didn't know that. Thanks!


So what would result would you return for:

0-based indexOf "" in "abc"

0-based lastIndexOf "" in "abc"

?


For the empty string, lastIndexOf would return 3, because the empty string is present at all positions. (When taking the substring (i, i) for any index i, the result is the empty string.) This assumes that the index “after the last character” is considered a valid index for that purpose, which it typically is. (Where can I place the cursor within the delimiters of "abc"? There are four positions where I can place it, 0–3.) Otherwise, 2 would be the last position.


>To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

I guess that is a good way to look at it. It would mean:

indexof "" in "abca" is 0

indexof "abca" in "" is invalid

indexof "" in "" is 0

Which feels unintuitive.


If you feel that ‘indexof "" in "" is 0’ is unintuitive, consider that indexOf s in s is zero for all strings s (including for the empty string).

Similarly, indexOf st in s is invalid for all strings s if t is non-empty, that is if s is a proper prefix of the search string. Your example ‘indexof "abca" in "" is invalid’ is just one case of that (with s = empty string and t = "abca"), completely analogous to ‘indexof "babca" in "b" is invalid’.

So I’d say your intuition needs adjusting. ;)


Maybe. ;0)

Your approach seems to be consistent. Unfortunately I think it is too complicated to explain to my non-programmer users.


It's 0 and this fact is sometimes referred to as epsilon when we talk about parsing.


0, obviously. It's the only consistent choice.

This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.

The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).


To me, this seems similar to asking what should

  int divide(int numerator, int denominator)
do when denominator == 0, or what should a modern version of

  int atoi(const char* s)
do when s == "foo"?

I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.


If I were implementing this in a new language, I'd lean toward making a needle of the empty string throw an error. When the feature request inevitably comes in to make it return a specific value, then I'd have usage information to guide the implementation. :)


The most appropriate behavior is implemented by regex.

Find the empty string and replace with say

  `($&)` 
where

  `$&`
refers to the captured variable, and with input

  `xyz`,
the result is

  `()x()y()z()`
With an empty input, the result is

  `()`.
The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).

The limiting case is also an interesting way to look at it:

  ()a()b()c()
  ()a()b()
  ()a()
  ()


What's the logic behind returning anything but -1 (not found) ?

Given "asdf".indexOf("a") is 0

Then "asdf"[0] is "a"

--

Given "".indexOf("") is 0

Then ""[0] is "" (which it's not in any language I know of)


"asdf".indexOf("as") is 0

but

"asdf"[0] is NOT "as".

so there can be no expectation that ""[0] is "".

Those two operations are not related to each other. It's more intuitive if you treat .indexOf() as .startOf(). Then "asdf"[0..x] is "as" for x=2, and ""[0..x] is "" for x=0.


If I were designing this, I'd want to treat this as an unsupported usage of "find".

If you're okay with using the same error code for (a) string not found and (b) erroneous usage, then returning -1 makes sense.

Since (b) is something that probably indicates a bug in the application code, I'd probably prefer that (b) triggers a different program flow. E.g., raising a C++ exception, failing an `assert`, etc.


But "asdf".indexOf("sd") is 1, so indexOf("") finds the index of the first matching subsequence "", thus 0.

In reverse, the function would be 'return the subsequence starting at index, for some length', which for "" and 0 would indeed be "" for length = 0.

Array indexing [] is unrelated


"asdf"[0] is 'a'

"asdf"[0:1] is "a"

and

""[0:0] is ""


The string of all strings that don’t contain themselves as substrings.


I have a simple solution. I just forbid searching for empty strings. And I also assume you can't find anything in an empty string.

We can discuss the philosophy of this for entire day, but at the end of it my job as the person who designs the contract is to make it clear, predictable and usable and I think presenting those special cases does not serve any of these goals.


Related: Dijkstra, ”Why numbering should start at zero”

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E...


I am a programmer. I numbers strings, lists and arrays from 0. 'Normal' people don't. I asked my wife what the position of the letter 'a' was in 'abc' and she said '1'. When I suggested it was 0 she looked at me like I was insane. Excel FIND() indexes from 1.


Excel FIND(), Javascript string indexOf()/lastIndexOf() and Qt QString indexOf()/lastIndexOf() all give different answers and some of them aren't even internally consistent.


It becomes easier to answer if we have two return values: the start and end of the matched range, this is more general but less convenient


since an empty string is '\0' then the index of the first occurency of an empty string (which is '\0') is 0 ;)


Only in C, other languages don't use null-terminated strings. A C++ std::string can contain a '\0' as the first character but still not be empty (though of course C++ code can also use C strings).

But even if an empty string truly contains no data it will match at the very beginning, so it seems returning 0 is still consistent.


Depends on your string representation. It's not unreasonable for an empty string to have no internal buffer allocated meaning it wouldn't even have a NUL terminator!


In math logic, any statement made about an element from an empty set is true (as well as negation of that statement).


That's what option and result types are for, so you don't need to mess with in-band signalling.


That doesn’t solve this problem at all.

Nothing about the article is about the downsides of in-band error reporting. It shows JavaScript’s (terrible) -1 indexOf response in a matter-of-fact way and only talks about the meaning of it, which is “not found”. The question is whether or not to return “not found” for certain inputs, not how to return that value.


Not so much "error reporting" as "there is no meaningful return value" for a position in a string -- if it wasn't found, it doesn't have a position.

If you use either Optional or Result you don't need to choose a special value like -1 to be your "invalid string position" case which could be used incorrectly.

I'd argue it's largely stylistic for this example, especially so outside of functional programming niches.


It nevertheless in no way solves the problem that the article is about.


Why would one compare Qt's QString as a C++ representative? Why not std::string ?


Because the application in question uses QString. I probably should have looked at std::string as well, but it might have just muddied the waters further.


0


Does every set contain the empty set? Arguably yes.

Does every sequence contain the empty sequence? If so, does it not contain the empty sequence at every position in itself? Harder to reason about.


I would have said the opposite! An empty set is just that: empty. Empty sequences meanwhile appear (kind of, at least) in nondeterministic finite automata


FFS, just disallow empty search strings (and document it).

    int String::indexOf(String partial) {
        if (partial.length == 0)
            throw SomeNastyException();
    }
No handwringing necessary.


string::npos




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: