IndexOf
To find any potential issues, write a program that searches the source code for instances of "IndexOf", using erm... IndexOf.
Changes to the way string comparisons work in the soon-to-be-released .NET 5.0 may break existing code on Windows. The issue came to light when developer Jimmy Bogard was upgrading a library to support .NET 5.0 and noticed a test failing. His code could find a string within another string using the Contains method – which …
Whilst making breaking changes is never a good thing, it's not like Visual Studio doesn't tell you to specify a culture, and the documentation does tell you that it defaults to StringComparison.CurrentCulture. If you're not controlling the comparison culture, on your own head be it.
They are doing different things though.
I'm not making a judgement on whether what MS have done here is right or wrong, but if IndexOf() is supposed to ignore culture specific continuation characters (such as soft hyphens for instance, so that the position of "Alice" in the two strings "Alice sees Bob" and "Al-ice sees Bob", is reported as the same) and Contains is not then the behaviour is arguably correct. It depends on how the correct behaviour is specified. And I'll bet you it is. Ask Jon Skeet...
I'm not disagreeing with you; they probably should both use the same comparison culture by default. But they don't, and the language specification says so.
To be a purist, there is actually no need for the language to provide both a Contains and an IndexOf function, as one can be implemented as a function of the other. The moment you start providing functions that can be implemented using other functions, you run into the issue that the behaviour may diverge. It's a balancing act between providing easy to use functionality, and making the language a memory test where you have to remember that to check if a string contains another string you have to check to see if the index of that second string in the first one is greater than zero.
I'd be interested to see what bytecode the two different approaches would produce; Contains() vs IndexOf() > -1. I wouldn't be surprised if the only difference is the comparison culture that gets recorded.
For the simple, built-in string operations, StringComparison.CurrentCulture is a horrible default. The default should have been StringComparison.JustDoABinaryComparisonAndDontTryToBeClever, then people who don't understand about the fancy comparison rules can get what they expect. And people who do understand about the fancy comparison rules and want to use them, could pass in StringComparison.CurrentCulture or whatever to opt in to fancy comparisons.
Cultures are a really complicated thing. Programmers regularly do things like assuming "i" (lowercase dotted letter i) and "I" (uppercase undotted letter i) are case-insensitively the same, which makes their programs not work in Turkey. (In Turkish, dotted i and undotted i are different letters; there are lowercase and uppercase forms of each).
Agreed. It is a proper ball-ache to not default to something invariant across cultures. Like a byte-by-byte comparison. Let's not get into big-endian vs little-endian architectures, and pretend that problem doesn't exist too.
It probably says something about where the original language designers at Microsoft were coming from. I should think the expectation was that string searching would be used more often for processing chunks of actual language, rather than being interested in exact string equality. I'm with you on this not necessarily being the best default.
It was best practice back when I did my MCPD in .NET 2.0 (here be dragons) to *explicitly* specify both the culture for certain methods (often just Invariant, or using e.g. ToUpperInvariant), and the comparison method, because if someone buggered about with the Culture settings in Windows, your application could end up behaving differently.
Not a lot of sympathy if people have been too lazy to do it or get a linter to do it for them to be honest... They're the same people that don't pay attention to the collation settings on databases then wonder why queries don't behave as expected on some deployments...
What if they appreciate more culturally sensitive string comparisons?
Anyway, seems like an anti-pattern to use two different functions to do one job. Why not just (pseudocode, obvs):
loc = indexOf (str, search);
If (loc != -1) {
etc.
} else {
...
}
Dunno. Contains presumably returns false, and indexOf, -1.
Probably no better/worse than a "Contains" followed by an "indexOf". It's still weird to do two steps where one will suffice. 'Contains' only makes sense if you don't care where and just use it on its own. That way, the conflict between the two should never really arise. You might get some memory cache hit benefits in the second call to indexOf, but I'd bet the compiler would have a hard time optimising away the duplicate tests. String comparisons can be expensive, and then there's the (small) hit from another API function call/parameter pass.
I can see the argument for API stability, but we'll never get anywhere if these things are kept in cement. Particularly when the cross-platform consistency benefits of using a standard pkg like ICU are on offer.
If str is null?
In the latest language specifications for C#, it will be (by default) a compile error for either failing to check for whether a nullable is null, or failing to declare that it isn't nullable. If you declare it as not nullable and try to assign a value to it that can be null? Guess what? A compile error (unless you explicitly allow it).
If you declare str as nullable and don't assign to it, that line of code won't compile.
If you declare it as non-nullable and don't assign to it, it will default to having string.Empty as its value so won't be null.
If you declare it as non-nullable and try to assign something to it, that won't compile unless you null-check it (e.g. using the null-coalescing operator, ?? to assign a non-null value), or explicitly allow that value to be assigned without null checking it (using the null forgiving operator, postfix !). If you use the null forgiving operator, you are explicitly stating that you know it may be null, and are effectively making a promise to not do anything with that variable that might go bang. If you then do use it where you can't have a null, you will get a Null Reference Exception. Which you deserve. Because you lied to the compiler.
So, the short answer is, if str is null, your code won't compile. If you jump through hoops to make it compile, then you'll get an exception. Which is reasonable behaviour
> If you then do use it where you can't have a null, you will get a Null Reference Exception.
So building bulletproof code requires the traditional two step shuffle; null check first with sensible default, then call the indexOf(str, search)
Relying on the compiler's all well and good, but strings frequently get nulled out through subsequent assignment/manipulation. Would be nice to know that the indexOf utility handles nulls gracefully.
Can't help but think this is one of those arguments similar to cAse sensitVity. More often than not a right PITA.
Too long since I actually wrote code - but agreed, that's good behaviour. It's essentially a bit of static-analysis baked into the `compiler` (inverted commas because it's all JIT compiled to MC anyway). Always been a proponent of static analysis/"code contracts" personally, the less that fail at runtime c.f. compile time the better IMO.
Icon because, well, I've rather obviously had a few based on this comment's lack of literacy ...
You don't have to use these functions in the same place, their behaviour should still be consistent. All we know about this dev's code is that tests that used indexOf failed, there's no indication that they had calls to contains followed by similar calls to indexOf.
The answer to inconsistent string comparison functions isn't to pick one and ignore the other, it's to standardise them so they behave the same.
From the 3rd para:
His code could find a string within another string using the Contains method – which answers the question "is the searched string contained in the target string" – but when he used the IndexOf method to find the location of the string, it returned -1, meaning not found.
Quote from MS engineer, implying this is the scenario of concern:
It is not right to compare the results of Contains with IndexOf..
2nd last para, another dev:
"...would take it as given that if str.Contains(whatever) succeeds, there is no need to inspect the result from str.IndexOf(whatever) because we were just told it is in there and therefore can be found." That does not seem unreasonable.
But agreed, any string comparison functions should behave consistently. It seems folks are tripping up on the fancy-pants aspects of the system culture supported by indexOf, since contains is, by some argument, broken, or at least feature poor.
It seems folks are tripping up on the fancy-pants aspects of the system culture supported by indexOf, since contains is, by some argument, broken, or at least feature poor.
The problem here isn't that IndexOf supports cultures, and Contains doesn't, but that they default to subtly different cultures if you don't specify one, and rely on the default. Which is why the IDE warns you when you do that, because usually you should specify a culture, even if it is Ordinal or InvariantCulture. If you're actually interested in treating strings as byte arrays and searching for a series of specific bytes, then cast it to a byte array and do that instead, without worrying about all the messiness of Unicode with its multi-byte character encodings.
Thanks again for the detailed clarification. You can probably tell I'm not a .Net coder. I do wonder if there's any good reason why Contains and IndexOf would have different default behaviours; it's almost as if they have different intended use-case scenarios, not just different functions. If not, is this a mis-feature? I note your reply to AC below; again helpful, but Micros~1's thinking philosophy here isn't 100% clear. It would seem helpful if they were consistent about handling default cultures and I can't see why they would need to have differences. This change maybe just a long belated effort to fix that oversight.
But the test in the article highlights that indexOf returns -1 even though the search string was included in the subject.
Just because it has a new-line before it, doesn't invalidate it as a string.
At least I now understand why string theory is so complicated though.
It's probably, in this case, down to whether your default comparison culture considers CR + LF to be indivisible. It's not an unreasonable thing to be defined as such, to be fair. Line endings in text files are notoriously variable. Some systems use CR + LF, some use just CR, others use just LF, some even use LF + CR. To treat the "newline" as indivisible (e.g. CRLF not CR + LF) would imply that, for instance, checking for the existence for "{LF}Bob" in the string "Alice{CRLF}and{CRLF}Bob" would not match, whilst "{CRLF}Bob" would, unless you specify the comparison culture where the searched string is essentially "Alice{CR}{LF}and{CR}{LF}Bob". The same rules about divisibility probably apply to things like diacritics (you woulnd't search for the acute accent on its own and expect to find a match in a string containing "é" for example).
The specification for string.Contains, where a comparison culture is not specified, says:
This method performs an ordinal (case-sensitive and culture-insensitive) comparison.
whereas, the documentation for string.IndexOf says:
This method performs a word (case-sensitive and culture-sensitive) search using the current culture. (my emphasis).
Whether it is right or not for it to be specified this way is another question, but you can't argue it isn't behaving according to spec!
{CR} and {LF} are not a culture issue, but a platform (System) issue (iOS, Linux, Windows, ...), as you say yourself.
A search should not fail to find "\nBob" in a string "Alice\r\nBob" for any culture chosen. Culture can't (shouldn't) define whether it is {CR}{LF} or {CRLF}
I can't imagine any (modern) culture which wouldn't want to achieve the same thing with {CR}{LF} as with just {LF}.
The days of separate actuators on the typewriter are long gone. While technically the CRLF combo is correct (we want to both home the carriage and feed the paper through by one line), those functions have been combined into the "Enter" key so 1 byte should be enough to infer meaning.
Unless you're saying that the {CR} portion should home the carriage at the other end for RTL cultures? In which case they're clearly separate operations and in that case should be treated as such.
CRLF vs LF vs LFCR - let the religious wars being.
Heretic here - I think C/LF is spot-on-right semantically, Un*x / MacOS is wrong. Horrible admission, but there we go. Maybe because once upon a time I use a real mechanical type-writer where CR and LF were different physical actions, as replicated in VT{52,1xx} and thereby into history..
Icon because, well, DOS^h^h^h Windows got it right IMO
CRLF is correct ... for a TTy.
Sadly, TTy/VT is lowest common denominator ergo CRLF is corrrect regardless of 1-bye (semantic) excess. Seems a bit ironic to refer to {CR}/{LF} as "clearly seperate operations" that should be treated as such for RTL, which is reason they're separate operations^h^h^h TTy codes in Windows.
Speaking as an old fart who *used* mechanical typewriters, MS got this one right. CR and LF are different. VTxxx mimics the mechanical operation, ipso facto *either* CR/LF or LF/CR are just fine but merely LF is not. Oh, and please remember line-oriented hard-copy output, e.g. DM printer* .... I know, it's archaic, but it still *works* and some of us don't like wasting money on shiny)
* Citizen 120D - amazingly robust kit
Truer words, etc. Not for nothing does the second I in ASCII stand for Interchange. Code Pages were display oriented. Unicode has at tried to deal with lexical ordering and code point equality (not always successfully, mind you; there's a fair bit of early unfortunate decision making that still bites). Composed vs. uncomposed, for example. Multiple semantically equivalent CPs.
I suspect and hope that this functionality is used more for fuzzy logic than anything business critical. For example, if a feedback form contains the word 'complain' it gets routed automatically to /dev/null the complaints department, whereas if it contains the word 'fabulous' it goes to marketing.
I see this all the time, especially from MSFT!!!
An API call should NEVER change its behavior. If you want to change how it behaves in order to enhance features etc... then give it a different name !!!
And don't get me started on Python 2 vs Python 3 !!!!
Indeed. If you're writing code, and need to do a string operation, do you want to have to go and look up which version of the built-in function you should be using?
It's a bit naughty of MS to change the behaviour of an existing function, yes, but languages evolve, and with C#, changes aren't made on a whim, but usually as a result of feedback from the community and a fair amount of debate.
IBM managed to have some APIs change behaviour on the AS/400 without breaking existing code. They did so by adding some optional parameters. If those parameters weren't used or contained default values, the behaviour remained exactly like it was before, thus not breaking existing code. If those parameters were used with non-default values (so new code), the behaviour differed in completely predictable and consistent ways.
This isn't the first time (and won't be the last time) that string comparisons cause bugs. You have to read the documentation really carefully... otherwise you'll just be getting confused about which whitespace character matches which other one, etc.
...or just restrict your strings to ASCII and the language to English. That works too.