The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y". The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set. Therefore the empty set is a subset of every set.
A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.
This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.
While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).
On the other hand, a predicate that tests whether an empty substring is found in a string must indeed return "true".
In a function that must return a single value, an error must be signaled both when there is no value that can be returned and when there are multiple values that could be returned.
If the index of the substring is defined as the smallest such index, which makes it unique even for multiple substring occurrences, then the search function must return 0 or 1 for an empty substring, whichever is defined to be the first index.
Attempting to search an empty string for non-empty or empty substrings should always signal an error, because there are no valid index values that could be returned.
Yeah, I think the convention should be to return the smallest valid index such that the predicate is satisfied, or an error value if the predicate isn't satisfied. As you point out, it's the same as when a non-empty string appears multiple times in a given string. So I think 0 is the only reasonable result in this case, assuming 0-based indexing.
>While an empty substring is a substring of every string, attempting to find its index in any string must signal an error, because there is no univocal value of such an index (all possible index values are equally valid answers).
The operations in question are to find the first (IndexOf) and last (LastIndexOf) indexes. So if the empty exists at every index, you just return the first or last one.
>Attempting to search an empty string for non-empty or empty substrings should always signal an error
I believe that an empty string is valid input for each parameter and should return 'not found' in some form, rather than an error. It is splitting hairs a bit though.
> The subset relation is defined as "X is a subset of Y if for all x in X, x is in Y".
This might be a bit nit picky, but shouldn't that be "if and only if" or "iff" instead of just "if"?
> The negation of this is "X is not a subset of Y if there exists some x in X, such that x is not in Y". We can see from this that the empty set is NOT NOT a subset of every set.
In what follows I'm replacing lower case x with lower case e.
I'm kind of simple minded, so find that a bit confusing because of using the negated definition and a double not, and also "for all e in X" when X is the empty set because it makes you have to keep for the rest of the definition that e might be nonexistent.
This is clearer to me:
1. Rewrite the definition so it is in terms of all e, not all e in X: "X is a subset of Y if and only if for all e, e is in X implies e is in Y". This form of the definition only talks about elements that exist.
2. "e is in X implies e is in Y" is equivalent to "e is not in Y implies e is not in X", which gives us this equivalent definition: "X is a subset of Y if and only if for all e, e is not in Y implies e is not in X".
3. When X is the empty set it is always the case that e is not in X, and so "e is not in Y implies e is not in X" is always true (because both false implies true and true implies true are true), and so X is a subset of Y by the definition of subset.
Because the string you are trying to find is in some variable which at runtime in some cases for some reason happens to be an empty string and you don't (want to) special case this and just "let it work by itself"?
There are many cases where a user-provided string is passed to the string API. Generally the programmer should handle empty strings as a special case, but when they don't the API needs to return something.
That is certainly an approach. But wouldn't that make indexOf() and lastIndexof() different in cases you would intuitively expect them to be the same, e.g:
No, because “the index of a string within a string” would be consistently defined to always be the lower index (or, alternatively, always the upper index). There is no intrinsic relation to the search direction, since all characters of the search string have to be compared regardless of the direction. For example, when searching from the end, the search algorithm could start comparing at index `totalLength - searchStringLength`.
To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.
>it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.
Marinated on this a bit more. I think this the most straightforward and internally consistent way to think about this. Thanks for the insight.
So I figure based on this:
1-based first index of v1 in v2 is:
| v1 | v2 | IndexOf(v1,v2) |
|-------|-------------|----------------|
| [] | [] | 1 |
| aba | [] | N/A |
| [] | aba | 1 |
| a | a | 1 |
| a | aba | 1 |
| x | y | N/A |
| world | hello world | 7 |
Where []=empty.
This is the same as Excel FIND() and differs from Javascript indexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).
1-based last index of v1 in v2 is:
| v1 | v2 | LastIndexOf(v1,v2) |
|-------|-------------|--------------------|
| [] | [] | 1 |
| aba | [] | N/A |
| [] | aba | 4 |
| a | a | 1 |
| a | aba | 3 |
| x | y | N/A |
| world | hello world | 7 |
This differs from Javascript lastIndexOf() (ignoring difference in indexing) only for "".indexOf("") which returns -1 (N/A).
For the empty string, lastIndexOf would return 3, because the empty string is present at all positions. (When taking the substring (i, i) for any index i, the result is the empty string.) This assumes that the index “after the last character” is considered a valid index for that purpose, which it typically is. (Where can I place the cursor within the delimiters of "abc"? There are four positions where I can place it, 0–3.) Otherwise, 2 would be the last position.
>To further illustrate, it would be the index where, if the search string is removed from the found position, it would have to be inserted in order to revert to the original string.
I guess that is a good way to look at it. It would mean:
If you feel that ‘indexof "" in "" is 0’ is unintuitive, consider that indexOf s in s is zero for all strings s (including for the empty string).
Similarly, indexOf st in s is invalid for all strings s if t is non-empty, that is if s is a proper prefix of the search string. Your example ‘indexof "abca" in "" is invalid’ is just one case of that (with s = empty string and t = "abca"), completely analogous to ‘indexof "babca" in "b" is invalid’.
This is only confusing if you're thinking of taking single elements from an array, rather than thinking in terms of slices of arrays. Slicing also makes it obvious why we must use half-open ranges and 0-based indexing.
The only minor gotcha is that, when doing a request for "find all non-overlapping" or "find next strictly after", you have to increment the search index by 1 even though the size of the match is 0. ("find all overlapping" or "find next starting after" always increment by 1 and ignore the size of the match).
do when denominator == 0, or what should a modern version of
int atoi(const char* s)
do when s == "foo"?
I.e., the answer is that the parameter lies outside the domain of well-supported values, so ideally the callee should somehow help the programmer realize they have a programming error.
If I were implementing this in a new language, I'd lean toward making a needle of the empty string throw an error. When the feature request inevitably comes in to make it return a specific value, then I'd have usage information to guide the implementation. :)
The most appropriate behavior is implemented by regex.
Find the empty string and replace with say
`($&)`
where
`$&`
refers to the captured variable, and with input
`xyz`,
the result is
`()x()y()z()`
With an empty input, the result is
`()`.
The result of indexOf can consistently return the first index of the first half open subrange in a string. It just so happens that the first subrange of an empty string in an empty string is [0, 0).
The limiting case is also an interesting way to look at it:
Those two operations are not related to each other. It's more intuitive if you treat .indexOf() as .startOf(). Then "asdf"[0..x] is "as" for x=2, and ""[0..x] is "" for x=0.
If I were designing this, I'd want to treat this as an unsupported usage of "find".
If you're okay with using the same error code for (a) string not found and (b) erroneous usage, then returning -1 makes sense.
Since (b) is something that probably indicates a bug in the application code, I'd probably prefer that (b) triggers a different program flow. E.g., raising a C++ exception, failing an `assert`, etc.
I have a simple solution. I just forbid searching for empty strings. And I also assume you can't find anything in an empty string.
We can discuss the philosophy of this for entire day, but at the end of it my job as the person who designs the contract is to make it clear, predictable and usable and I think presenting those special cases does not serve any of these goals.
I am a programmer. I numbers strings, lists and arrays from 0. 'Normal' people don't. I asked my wife what the position of the letter 'a' was in 'abc' and she said '1'. When I suggested it was 0 she looked at me like I was insane. Excel FIND() indexes from 1.
Excel FIND(), Javascript string indexOf()/lastIndexOf() and Qt QString indexOf()/lastIndexOf() all give different answers and some of them aren't even internally consistent.
Only in C, other languages don't use null-terminated strings. A C++ std::string can contain a '\0' as the first character but still not be empty (though of course C++ code can also use C strings).
But even if an empty string truly contains no data it will match at the very beginning, so it seems returning 0 is still consistent.
Depends on your string representation. It's not unreasonable for an empty string to have no internal buffer allocated meaning it wouldn't even have a NUL terminator!
Nothing about the article is about the downsides of in-band error reporting. It shows JavaScript’s (terrible) -1 indexOf response in a matter-of-fact way and only talks about the meaning of it, which is “not found”. The question is whether or not to return “not found” for certain inputs, not how to return that value.
Not so much "error reporting" as "there is no meaningful return value" for a position in a string -- if it wasn't found, it doesn't have a position.
If you use either Optional or Result you don't need to choose a special value like -1 to be your "invalid string position" case which could be used incorrectly.
I'd argue it's largely stylistic for this example, especially so outside of functional programming niches.
Because the application in question uses QString. I probably should have looked at std::string as well, but it might have just muddied the waters further.
I would have said the opposite! An empty set is just that: empty. Empty sequences meanwhile appear (kind of, at least) in nondeterministic finite automata
A similar argument should show that the empty string is a substring of every string. Therefore indexOf(substring, string) should return a non-negative index if substring is the empty string.
This reminds me of a discussion on HN a while ago about predicates over the empty set. There was debate over whether "all(x == "foo" for x in my_array)" should evaluate to True or False if my_array is empty. I tried to point out that if it evaluated to False, it would break first order logic in a similar way as I explained above, but people were still trying to argue the opposite case.