What is the index of an empty string in an empty string?

fasterik · on Dec 14, 2023

The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y". The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set. Therefore the empty set is a subset of every set.

A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.

This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.

adrian_b · on Dec 14, 2023

While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".

In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.

If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.

Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.

fasterik · on Dec 14, 2023

Yeah, I think the convention should be to return the smallest valid index such that the predicate is satisfied, or an error value if the predicate isn't satisfied. As you point out, it's the same as when a non-empty string appears multiple times in a given string. So I think 0 is the only reasonable result in this case, assuming 0-based indexing.

hermitcrab · on Dec 15, 2023

>While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).

The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.

>Attempting to search an empty string for non-empty or empty substrings should always signal an error

I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.

tzs · on Dec 14, 2023

> The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y".

This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?

> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.

In what follows I'm replacing lower case x with lower case e.

I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.

This is clearer to me:

1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.

2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".

3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.

bryancoxwell · on Dec 14, 2023

What reason could there possibly be for looking for an empty string?

tharkun__ · on Dec 14, 2023

Because the string you are trying to find is in some variable which at runtime in some cases for some reason happens to be an empty string and you don't (want to) special case this and just "let it work by itself"?

fasterik · on Dec 14, 2023

There are many cases where a user-provided string is passed to the string API. Generally the programmer should handle empty strings as a special case, but when they don't the API needs to return something.

bryancoxwell · on Dec 14, 2023

Thanks!

hermitcrab · on Dec 15, 2023

Is it a sensible thing to do: No.

Will users do it anyway: Yes.

We have to handle all possible values the user can provide.

ajuc · on Dec 15, 2023

Avoiding the need to special-case empty input.

ajuc · on Dec 14, 2023

The invariant is:

     haystack.substring(haystack.indexOf(needle), needle.length()) == needle

And indexOf should return lowest such number.

So it should return 0. LastIndexOf too, because the haystack is empty.

For nonempty haystack it should be

    "Abc".indexOf("") == 0
    "Abc".lastIndexOf("") == 3

l0b0 · on Dec 14, 2023

I'd rather have this invariant:

  haystack[haystack.indexOf(needle)] == needle

That is, there is no index of the empty string within a string. A nice thing about this is that it also works for non-character arrays:

  "abc".indexOf("") raises "No such entry exception"
  ["", "abc", ""].indexOf("") == 0

ajuc · on Dec 15, 2023

This invariant only works for 1-character needles. So no

    "Lorem Ipsum".indexOf("Ipsum")

Which is the main purpose of this function.

Also with your invariant types are different. Not "abc".indexOf("b") but "abc".indexOf('b').

layer8 · on Dec 14, 2023

> there isn’t a first position in the string

This isn’t necessarily true if you think of positions as where a vertical-bar cursor would be, that is, between characters. In

"|"

where the double quotes delimit an empty string, the vertical bar (intended to represent the cursor here) has a position between those delimiters.

hermitcrab · on Dec 14, 2023

That is certainly an approach. But wouldn't that make indexOf() and lastIndexof() different in cases you would intuitively expect them to be the same, e.g:

0-based indexOf "a" in "a" would be 0

0-based lastIndexOf "a" in "a" would be 1

?

layer8 · on Dec 14, 2023

No, because “the index of a string within a string” would be consistently defined to always be the lower index (or, alternatively, always the upper index). There is no intrinsic relation to the search direction, since all characters of the search string have to be compared regardless of the direction. For example, when searching from the end, the search algorithm could start comparing at index `totalLength - searchStringLength`.

To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

hermitcrab · on Dec 14, 2023

>it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

Marinated on this a bit more. I think this the most straightforward and internally consistent way to think about this. Thanks for the insight.

So I figure based on this:

1-based first index of v1 in v2 is:

| v1 | v2 | IndexOf(v1,v2) |

|-------|-------------|----------------|

| [] | [] | 1 |

| aba | [] | N/A |

| [] | aba | 1 |

| a | a | 1 |

| a | aba | 1 |

| x | y | N/A |

| world | hello world | 7 |

Where []=empty.

This is the same as Excel FIND() and differs from Javascript indexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).

1-based last index of v1 in v2 is:

| v1 | v2 | LastIndexOf(v1,v2) |

|-------|-------------|--------------------|

| [] | [] | 1 |

| aba | [] | N/A |

| [] | aba | 4 |

| a | a | 1 |

| a | aba | 3 |

| x | y | N/A |

| world | hello world | 7 |

This differs from Javascript lastIndexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).

jmholla · on Dec 14, 2023

If you add four spaces at the beginning of a line, Hacker News will treat it as preformatted. Here are you tables like such just for ease of reading.

---

1-based first index of v1 in v2 is:

    |  v1   |     v2      | IndexOf(v1,v2) |
    |-------|-------------|----------------|
    | []    | []          | 1              |
    | aba   | []          | N/A            |
    | []    | aba         | 1              |
    | a     | a           | 1              |
    | a     | aba         | 1              |
    | x     | y           | N/A            |
    | world | hello world | 7              |

1-based last index of v1 in v2 is:

    |  v1   |     v2      | LastIndexOf(v1,v2) |
    |-------|-------------|--------------------|
    | []    | []          | 1                  |
    | aba   | []          | N/A                |
    | []    | aba         | 4                  |
    | a     | a           | 1                  |
    | a     | aba         | 3                  |
    | x     | y           | N/A                |
    | world | hello world | 7                  |

hermitcrab · on Dec 15, 2023

Didn't know that. Thanks!

hermitcrab · on Dec 14, 2023

So what would result would you return for:

0-based indexOf "" in "abc"

0-based lastIndexOf "" in "abc"

?

layer8 · on Dec 14, 2023

For the empty string, lastIndexOf would return 3, because the empty string is present at all positions. (When taking the substring (i, i) for any index i, the result is the empty string.) This assumes that the index “after the last character” is considered a valid index for that purpose, which it typically is. (Where can I place the cursor within the delimiters of "abc"? There are four positions where I can place it, 0–3.) Otherwise, 2 would be the last position.

hermitcrab · on Dec 14, 2023

>To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.

I guess that is a good way to look at it. It would mean:

indexof "" in "abca" is 0

indexof "abca" in "" is invalid

indexof "" in "" is 0

Which feels unintuitive.

layer8 · on Dec 14, 2023

If you feel that ‘indexof "" in "" is 0’ is unintuitive, consider that indexOf s in s is zero for all strings s (including for the empty string).

Similarly, indexOf st in s is invalid for all strings s if t is non-empty, that is if s is a proper prefix of the search string. Your example ‘indexof "abca" in "" is invalid’ is just one case of that (with s = empty string and t = "abca"), completely analogous to ‘indexof "babca" in "b" is invalid’.

So I’d say your intuition needs adjusting. ;)

hermitcrab · on Dec 14, 2023

Maybe. ;0)

Your approach seems to be consistent. Unfortunately I think it is too complicated to explain to my non-programmer users.

hurril · on Dec 14, 2023

It's 0 and this fact is sometimes referred to as epsilon when we talk about parsing.

o11c · on Dec 14, 2023

0, obviously. It's the only consistent choice.

This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.

The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).

CoastalCoder · on Dec 14, 2023

To me, this seems similar to asking what should

  int divide(int numerator, int denominator)

do when denominator == 0, or what should a modern version of

  int atoi(const char* s)

do when s == "foo"?

I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.

fwip · on Dec 14, 2023

If I were implementing this in a new language, I'd lean toward making a needle of the empty string throw an error. When the feature request inevitably comes in to make it return a specific value, then I'd have usage information to guide the implementation. :)

arithma · on Dec 14, 2023

The most appropriate behavior is implemented by regex.

Find the empty string and replace with say

  `($&)`

where

  `$&`

refers to the captured variable, and with input

  `xyz`,

the result is

  `()x()y()z()`

With an empty input, the result is

  `()`.

The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).

The limiting case is also an interesting way to look at it:

  ()a()b()c()
  ()a()b()
  ()a()
  ()

cyrusmg · on Dec 14, 2023

What's the logic behind returning anything but -1 (not found) ?

Given "asdf".indexOf("a") is 0

Then "asdf"[0] is "a"

--

Given "".indexOf("") is 0

Then ""[0] is "" (which it's not in any language I know of)

rhn_mk1 · on Dec 14, 2023

"asdf".indexOf("as") is 0

but

"asdf"[0] is NOT "as".

so there can be no expectation that ""[0] is "".

Those two operations are not related to each other. It's more intuitive if you treat .indexOf() as .startOf(). Then "asdf"[0..x] is "as" for x=2, and ""[0..x] is "" for x=0.

CoastalCoder · on Dec 14, 2023

If I were designing this, I'd want to treat this as an unsupported usage of "find".

If you're okay with using the same error code for (a) string not found and (b) erroneous usage, then returning -1 makes sense.

Since (b) is something that probably indicates a bug in the application code, I'd probably prefer that (b) triggers a different program flow. E.g., raising a C++ exception, failing an `assert`, etc.

somewhereoutth · on Dec 14, 2023

But "asdf".indexOf("sd") is 1, so indexOf("") finds the index of the first matching subsequence "", thus 0.

In reverse, the function would be 'return the subsequence starting at index, for some length', which for "" and 0 would indeed be "" for length = 0.

Array indexing [] is unrelated

plagiarist · on Dec 14, 2023

"asdf"[0] is 'a'

"asdf"[0:1] is "a"

and

""[0:0] is ""

robertsdionne · on Dec 14, 2023

The string of all strings that don’t contain themselves as substrings.

onetimeuse92304 · on Dec 14, 2023

I have a simple solution. I just forbid searching for empty strings. And I also assume you can't find anything in an empty string.

We can discuss the philosophy of this for entire day, but at the end of it my job as the person who designs the contract is to make it clear, predictable and usable and I think presenting those special cases does not serve any of these goals.

stenl · on Dec 14, 2023

Related: Dijkstra, ”Why numbering should start at zero”

https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E...

hermitcrab · on Dec 15, 2023

I am a programmer. I numbers strings, lists and arrays from 0. 'Normal' people don't. I asked my wife what the position of the letter 'a' was in 'abc' and she said '1'. When I suggested it was 0 she looked at me like I was insane. Excel FIND() indexes from 1.

hermitcrab · on Dec 14, 2023

Excel FIND(), Javascript string indexOf()/lastIndexOf() and Qt QString indexOf()/lastIndexOf() all give different answers and some of them aren't even internally consistent.

zwieback · on Dec 14, 2023

It becomes easier to answer if we have two return values: the start and end of the matched range, this is more general but less convenient

gtroja · on Dec 14, 2023

since an empty string is '\0' then the index of the first occurency of an empty string (which is '\0') is 0 ;)

not2b · on Dec 14, 2023

Only in C, other languages don't use null-terminated strings. A C++ std::string can contain a '\0' as the first character but still not be empty (though of course C++ code can also use C strings).

But even if an empty string truly contains no data it will match at the very beginning, so it seems returning 0 is still consistent.

flqn · on Dec 14, 2023

Depends on your string representation. It's not unreasonable for an empty string to have no internal buffer allocated meaning it wouldn't even have a NUL terminator!

deepsun · on Dec 14, 2023

In math logic, any statement made about an element from an empty set is true (as well as negation of that statement).

frou_dh · on Dec 14, 2023

That's what option and result types are for, so you don't need to mess with in-band signalling.

skrebbel · on Dec 14, 2023

That doesn’t solve this problem at all.

Nothing about the article is about the downsides of in-band error reporting. It shows JavaScript’s (terrible) -1 indexOf response in a matter-of-fact way and only talks about the meaning of it, which is “not found”. The question is whether or not to return “not found” for certain inputs, not how to return that value.

flqn · on Dec 14, 2023

Not so much "error reporting" as "there is no meaningful return value" for a position in a string -- if it wasn't found, it doesn't have a position.

If you use either Optional or Result you don't need to choose a special value like -1 to be your "invalid string position" case which could be used incorrectly.

I'd argue it's largely stylistic for this example, especially so outside of functional programming niches.

skrebbel · on Dec 15, 2023

It nevertheless in no way solves the problem that the article is about.

p0w3n3d · on Dec 14, 2023

Why would one compare Qt's QString as a C++ representative? Why not std::string ?

hermitcrab · on Dec 15, 2023

Because the application in question uses QString. I probably should have looked at std::string as well, but it might have just muddied the waters further.

syngrog66 · on Dec 14, 2023

flqn · on Dec 14, 2023

Does every set contain the empty set? Arguably yes.

Does every sequence contain the empty sequence? If so, does it not contain the empty sequence at every position in itself? Harder to reason about.

asplake · on Dec 14, 2023

I would have said the opposite! An empty set is just that: empty. Empty sequences meanwhile appear (kind of, at least) in nondeterministic finite automata

stickfigure · on Dec 14, 2023

FFS, just disallow empty search strings (and document it).

    int String::indexOf(String partial) {
        if (partial.length == 0)
            throw SomeNastyException();
    }

No handwringing necessary.

ok123456 · on Dec 14, 2023

string::npos