back to article Devs strung up about .NET 5.0 string changes that may break working code are told: It's not a bug, it's a feature

Changes to the way string comparisons work in the soon-to-be-released .NET 5.0 may break existing code on Windows. The issue came to light when developer Jimmy Bogard was upgrading a library to support .NET 5.0 and noticed a test failing. His code could find a string within another string using the Contains method – which …

  1. Andy Non Silver badge
    Joke

    IndexOf

    To find any potential issues, write a program that searches the source code for instances of "IndexOf", using erm... IndexOf.

    1. b0llchit Silver badge
      Trollface

      Re: IndexOf

      $ for i in *.cs; do sed -i -e's/IndexOf/FixMSCrap_IndexOf/g' $i; done

      Maybe we should do a subst /sed/rm/. That may be a too radical fix. Next upgrade will do that automatically for you, probably when the neural net found a better program to replace the developers. (/s)

  2. Loyal Commenter Silver badge

    To be fair to Microsoft

    Whilst making breaking changes is never a good thing, it's not like Visual Studio doesn't tell you to specify a culture, and the documentation does tell you that it defaults to StringComparison.CurrentCulture. If you're not controlling the comparison culture, on your own head be it.

    1. sabroni Silver badge
      Stop

      To be fair to everyone else

      The comparison mechanism and defaults for contains and indexOf should be the same. There's no explanation why one is culture sensitive and one isn't.

      There's no value in them behaving differently, it's inconsistent for no good reason.

      1. J27

        Re: To be fair to everyone else

        It's pretty inconsistent that these two string methods, on the same object, do not work the same way.

        1. Loyal Commenter Silver badge

          Re: To be fair to everyone else

          They are doing different things though.

          I'm not making a judgement on whether what MS have done here is right or wrong, but if IndexOf() is supposed to ignore culture specific continuation characters (such as soft hyphens for instance, so that the position of "Alice" in the two strings "Alice sees Bob" and "Al-ice sees Bob", is reported as the same) and Contains is not then the behaviour is arguably correct. It depends on how the correct behaviour is specified. And I'll bet you it is. Ask Jon Skeet...

          1. sabroni Silver badge

            Re: They are doing different things though.

            The point being made is that they should do the same thing. There's no reason given why only one is culture sensitive. It's a language, not a memory test.

            1. Loyal Commenter Silver badge

              Re: They are doing different things though.

              I'm not disagreeing with you; they probably should both use the same comparison culture by default. But they don't, and the language specification says so.

              To be a purist, there is actually no need for the language to provide both a Contains and an IndexOf function, as one can be implemented as a function of the other. The moment you start providing functions that can be implemented using other functions, you run into the issue that the behaviour may diverge. It's a balancing act between providing easy to use functionality, and making the language a memory test where you have to remember that to check if a string contains another string you have to check to see if the index of that second string in the first one is greater than zero.

              I'd be interested to see what bytecode the two different approaches would produce; Contains() vs IndexOf() > -1. I wouldn't be surprised if the only difference is the comparison culture that gets recorded.

      2. razorfishsl

        Re: To be fair to everyone else

        MS going WOKE......

        The strings have aright to be acknowledged as having differences defined by themselves......

    2. Jon 37

      Re: To be fair to Microsoft

      For the simple, built-in string operations, StringComparison.CurrentCulture is a horrible default. The default should have been StringComparison.JustDoABinaryComparisonAndDontTryToBeClever, then people who don't understand about the fancy comparison rules can get what they expect. And people who do understand about the fancy comparison rules and want to use them, could pass in StringComparison.CurrentCulture or whatever to opt in to fancy comparisons.

      Cultures are a really complicated thing. Programmers regularly do things like assuming "i" (lowercase dotted letter i) and "I" (uppercase undotted letter i) are case-insensitively the same, which makes their programs not work in Turkey. (In Turkish, dotted i and undotted i are different letters; there are lowercase and uppercase forms of each).

      1. Loyal Commenter Silver badge

        Re: To be fair to Microsoft

        Agreed. It is a proper ball-ache to not default to something invariant across cultures. Like a byte-by-byte comparison. Let's not get into big-endian vs little-endian architectures, and pretend that problem doesn't exist too.

        It probably says something about where the original language designers at Microsoft were coming from. I should think the expectation was that string searching would be used more often for processing chunks of actual language, rather than being interested in exact string equality. I'm with you on this not necessarily being the best default.

    3. Stu J

      Re: To be fair to Microsoft

      It was best practice back when I did my MCPD in .NET 2.0 (here be dragons) to *explicitly* specify both the culture for certain methods (often just Invariant, or using e.g. ToUpperInvariant), and the comparison method, because if someone buggered about with the Culture settings in Windows, your application could end up behaving differently.

      Not a lot of sympathy if people have been too lazy to do it or get a linter to do it for them to be honest... They're the same people that don't pay attention to the collation settings on databases then wonder why queries don't behave as expected on some deployments...

  3. sbt
    Headmaster

    "... that doesn't create new productivity improvements for our users"

    What if they appreciate more culturally sensitive string comparisons?

    Anyway, seems like an anti-pattern to use two different functions to do one job. Why not just (pseudocode, obvs):

    loc = indexOf (str, search);

    If (loc != -1) {

    etc.

    } else {

    ...

    }

    1. Wibble

      Re: "... that doesn't create new productivity improvements for our users"

      If str is null?

      1. sbt
        Happy

        If str is null?

        Dunno. Contains presumably returns false, and indexOf, -1.

        Probably no better/worse than a "Contains" followed by an "indexOf". It's still weird to do two steps where one will suffice. 'Contains' only makes sense if you don't care where and just use it on its own. That way, the conflict between the two should never really arise. You might get some memory cache hit benefits in the second call to indexOf, but I'd bet the compiler would have a hard time optimising away the duplicate tests. String comparisons can be expensive, and then there's the (small) hit from another API function call/parameter pass.

        I can see the argument for API stability, but we'll never get anywhere if these things are kept in cement. Particularly when the cross-platform consistency benefits of using a standard pkg like ICU are on offer.

      2. Loyal Commenter Silver badge

        Re: "... that doesn't create new productivity improvements for our users"

        If str is null?

        In the latest language specifications for C#, it will be (by default) a compile error for either failing to check for whether a nullable is null, or failing to declare that it isn't nullable. If you declare it as not nullable and try to assign a value to it that can be null? Guess what? A compile error (unless you explicitly allow it).

        If you declare str as nullable and don't assign to it, that line of code won't compile.

        If you declare it as non-nullable and don't assign to it, it will default to having string.Empty as its value so won't be null.

        If you declare it as non-nullable and try to assign something to it, that won't compile unless you null-check it (e.g. using the null-coalescing operator, ?? to assign a non-null value), or explicitly allow that value to be assigned without null checking it (using the null forgiving operator, postfix !). If you use the null forgiving operator, you are explicitly stating that you know it may be null, and are effectively making a promise to not do anything with that variable that might go bang. If you then do use it where you can't have a null, you will get a Null Reference Exception. Which you deserve. Because you lied to the compiler.

        So, the short answer is, if str is null, your code won't compile. If you jump through hoops to make it compile, then you'll get an exception. Which is reasonable behaviour

        1. sbt
          Pint

          reasonable behaviour

          Thanks, Loyal Commenter, for clearing that up; nice detail.

        2. Wibble

          Re: "... that doesn't create new productivity improvements for our users"

          > If you then do use it where you can't have a null, you will get a Null Reference Exception.

          So building bulletproof code requires the traditional two step shuffle; null check first with sensible default, then call the indexOf(str, search)

          Relying on the compiler's all well and good, but strings frequently get nulled out through subsequent assignment/manipulation. Would be nice to know that the indexOf utility handles nulls gracefully.

          Can't help but think this is one of those arguments similar to cAse sensitVity. More often than not a right PITA.

        3. PeteA
          Pint

          Re: "... that doesn't create new productivity improvements for our users"

          Too long since I actually wrote code - but agreed, that's good behaviour. It's essentially a bit of static-analysis baked into the `compiler` (inverted commas because it's all JIT compiled to MC anyway). Always been a proponent of static analysis/"code contracts" personally, the less that fail at runtime c.f. compile time the better IMO.

          Icon because, well, I've rather obviously had a few based on this comment's lack of literacy ...

    2. sabroni Silver badge

      Re: Anyway, seems like an anti-pattern to use two different functions to do one job

      You don't have to use these functions in the same place, their behaviour should still be consistent. All we know about this dev's code is that tests that used indexOf failed, there's no indication that they had calls to contains followed by similar calls to indexOf.

      The answer to inconsistent string comparison functions isn't to pick one and ignore the other, it's to standardise them so they behave the same.

      1. sbt
        Windows

        there's no indication that they had calls to contains followed by similar calls to indexOf

        From the 3rd para:

        His code could find a string within another string using the Contains method – which answers the question "is the searched string contained in the target string" – but when he used the IndexOf method to find the location of the string, it returned -1, meaning not found.

        Quote from MS engineer, implying this is the scenario of concern:

        It is not right to compare the results of Contains with IndexOf..

        2nd last para, another dev:

        "...would take it as given that if str.Contains(whatever) succeeds, there is no need to inspect the result from str.IndexOf(whatever) because we were just told it is in there and therefore can be found." That does not seem unreasonable.

        But agreed, any string comparison functions should behave consistently. It seems folks are tripping up on the fancy-pants aspects of the system culture supported by indexOf, since contains is, by some argument, broken, or at least feature poor.

        1. sabroni Silver badge
          Thumb Up

          Re: there's no indication that they had calls to contains followed by similar calls to indexOf

          Thanks! Much better comprehension skills than me!

          1. sbt
            Thumb Up

            comprehension

            No worries; it's a confusing scenario and not unreasonable to suggest it was contrived given the original reporter noticed a test failing; didn't down-thumb you on that basis; sorry others were more harsh.

            1. PeteA
              Thumb Up

              Re: comprehension

              Too easy to down-thumb, some of us have been doing this for a while and some haven't. Kudos.

        2. Loyal Commenter Silver badge

          Re: there's no indication that they had calls to contains followed by similar calls to indexOf

          It seems folks are tripping up on the fancy-pants aspects of the system culture supported by indexOf, since contains is, by some argument, broken, or at least feature poor.

          The problem here isn't that IndexOf supports cultures, and Contains doesn't, but that they default to subtly different cultures if you don't specify one, and rely on the default. Which is why the IDE warns you when you do that, because usually you should specify a culture, even if it is Ordinal or InvariantCulture. If you're actually interested in treating strings as byte arrays and searching for a series of specific bytes, then cast it to a byte array and do that instead, without worrying about all the messiness of Unicode with its multi-byte character encodings.

          1. sbt
            Thumb Up

            they default to subtly different cultures if you don't specify one

            Thanks again for the detailed clarification. You can probably tell I'm not a .Net coder. I do wonder if there's any good reason why Contains and IndexOf would have different default behaviours; it's almost as if they have different intended use-case scenarios, not just different functions. If not, is this a mis-feature? I note your reply to AC below; again helpful, but Micros~1's thinking philosophy here isn't 100% clear. It would seem helpful if they were consistent about handling default cultures and I can't see why they would need to have differences. This change maybe just a long belated effort to fix that oversight.

    3. Anonymous Coward Silver badge
      FAIL

      Re: "... that doesn't create new productivity improvements for our users"

      But the test in the article highlights that indexOf returns -1 even though the search string was included in the subject.

      Just because it has a new-line before it, doesn't invalidate it as a string.

      At least I now understand why string theory is so complicated though.

      1. Loyal Commenter Silver badge

        Re: "... that doesn't create new productivity improvements for our users"

        It's probably, in this case, down to whether your default comparison culture considers CR + LF to be indivisible. It's not an unreasonable thing to be defined as such, to be fair. Line endings in text files are notoriously variable. Some systems use CR + LF, some use just CR, others use just LF, some even use LF + CR. To treat the "newline" as indivisible (e.g. CRLF not CR + LF) would imply that, for instance, checking for the existence for "{LF}Bob" in the string "Alice{CRLF}and{CRLF}Bob" would not match, whilst "{CRLF}Bob" would, unless you specify the comparison culture where the searched string is essentially "Alice{CR}{LF}and{CR}{LF}Bob". The same rules about divisibility probably apply to things like diacritics (you woulnd't search for the acute accent on its own and expect to find a match in a string containing "é" for example).

        The specification for string.Contains, where a comparison culture is not specified, says:

        This method performs an ordinal (case-sensitive and culture-insensitive) comparison.

        whereas, the documentation for string.IndexOf says:

        This method performs a word (case-sensitive and culture-sensitive) search using the current culture. (my emphasis).

        Whether it is right or not for it to be specified this way is another question, but you can't argue it isn't behaving according to spec!

        1. the Kris
          Boffin

          Re: "... that doesn't create new productivity improvements for our users"

          {CR} and {LF} are not a culture issue, but a platform (System) issue (iOS, Linux, Windows, ...), as you say yourself.

          A search should not fail to find "\nBob" in a string "Alice\r\nBob" for any culture chosen. Culture can't (shouldn't) define whether it is {CR}{LF} or {CRLF}

          1. Anonymous Coward Silver badge
            Boffin

            Re: "... that doesn't create new productivity improvements for our users"

            I can't imagine any (modern) culture which wouldn't want to achieve the same thing with {CR}{LF} as with just {LF}.

            The days of separate actuators on the typewriter are long gone. While technically the CRLF combo is correct (we want to both home the carriage and feed the paper through by one line), those functions have been combined into the "Enter" key so 1 byte should be enough to infer meaning.

            Unless you're saying that the {CR} portion should home the carriage at the other end for RTL cultures? In which case they're clearly separate operations and in that case should be treated as such.

            1. PeteA
              Windows

              Re: "... that doesn't create new productivity improvements for our users"

              CRLF vs LF vs LFCR - let the religious wars being.

              Heretic here - I think C/LF is spot-on-right semantically, Un*x / MacOS is wrong. Horrible admission, but there we go. Maybe because once upon a time I use a real mechanical type-writer where CR and LF were different physical actions, as replicated in VT{52,1xx} and thereby into history..

              Icon because, well, DOS^h^h^h Windows got it right IMO

            2. PeteA

              Re: "... that doesn't create new productivity improvements for our users"

              CRLF is correct ... for a TTy.

              Sadly, TTy/VT is lowest common denominator ergo CRLF is corrrect regardless of 1-bye (semantic) excess. Seems a bit ironic to refer to {CR}/{LF} as "clearly seperate operations" that should be treated as such for RTL, which is reason they're separate operations^h^h^h TTy codes in Windows.

              Speaking as an old fart who *used* mechanical typewriters, MS got this one right. CR and LF are different. VTxxx mimics the mechanical operation, ipso facto *either* CR/LF or LF/CR are just fine but merely LF is not. Oh, and please remember line-oriented hard-copy output, e.g. DM printer* .... I know, it's archaic, but it still *works* and some of us don't like wasting money on shiny)

              * Citizen 120D - amazingly robust kit

      2. Nigel 13

        Re: "... that doesn't create new productivity improvements for our users"

        I think I see a joke there. I'm clapping anyway.

      3. Munchausen's proxy
        Joke

        Re: "... that doesn't create new productivity improvements for our users"

        I've got it! Let's use regexes!

        1. PeteA
          Joke

          Re: "... that doesn't create new productivity improvements for our users"

          s/regexes/regices/g

  4. Anonymous Coward
    Anonymous Coward

    Strings were much easier to work with when everybody used ASCII.

    1. sqlrob

      7-bit or 8-bit?

      You've got rose colored glasses on if you don't remember that fun with mail and file transfers of text.

    2. Ken Hagan Gold badge

      There was never a time when everyone used ASCII. That was one of the reasons we have Unicode.

      1. sbt
        Angel

        one of the reasons we have Unicode

        Truer words, etc. Not for nothing does the second I in ASCII stand for Interchange. Code Pages were display oriented. Unicode has at tried to deal with lexical ordering and code point equality (not always successfully, mind you; there's a fair bit of early unfortunate decision making that still bites). Composed vs. uncomposed, for example. Multiple semantically equivalent CPs.

    3. Loyal Commenter Silver badge

      Except the people using EBCDIC.

      And those who were also using EBCDIC but it wasn't the same EBCDIC.

      And the other four people using the other four versions of EBCDIC...

      1. A.P. Veening Silver badge

        You beat me to mentioning EBCDIC, but to be complete you should also mention DBCS (all variants).

  5. Unep Eurobats
    Paris Hilton

    Inconvenience or disaster?

    I suspect and hope that this functionality is used more for fuzzy logic than anything business critical. For example, if a feedback form contains the word 'complain' it gets routed automatically to /dev/null the complaints department, whereas if it contains the word 'fabulous' it goes to marketing.

  6. Anonymous Coward
    Anonymous Coward

    Future issue of PowerShell as well then

    It was a brief happy moment there as I laughed at the pain Devs face with this... then I remembered PowerShell runs on .Net so will presumably also have the potential to break my automation/provisioning scripts... bugger.

    1. PeteA

      Re: Future issue of PowerShell as well then

      Sodding good reason to stick with a proper shell then

  7. Steve 129
    Facepalm

    An API should NEVER change its behavior

    I see this all the time, especially from MSFT!!!

    An API call should NEVER change its behavior. If you want to change how it behaves in order to enhance features etc... then give it a different name !!!

    And don't get me started on Python 2 vs Python 3 !!!!

    1. sbt
      Meh

      Re: An API should NEVER change its behavior

      That'd be great if there were never any bugs. If the API has no versioning support, then you can't add a new variant of the function that's 'fixed', and you can't fix the old one.

      APIs are hard.

      1. Loyal Commenter Silver badge

        Re: An API should NEVER change its behavior

        Indeed. If you're writing code, and need to do a string operation, do you want to have to go and look up which version of the built-in function you should be using?

        It's a bit naughty of MS to change the behaviour of an existing function, yes, but languages evolve, and with C#, changes aren't made on a whim, but usually as a result of feedback from the community and a fair amount of debate.

    2. A.P. Veening Silver badge

      Re: An API should NEVER change its behavior

      IBM managed to have some APIs change behaviour on the AS/400 without breaking existing code. They did so by adding some optional parameters. If those parameters weren't used or contained default values, the behaviour remained exactly like it was before, thus not breaking existing code. If those parameters were used with non-default values (so new code), the behaviour differed in completely predictable and consistent ways.

  8. Anonymous Coward
    Anonymous Coward

    Hey, Microsoft is now cool, don't pick on them. They developing in a modern way like all the cool kids.

    AKA, will change something every other release and don't give a shit if it works or breaks anything of yours and will not tell you about it.

    1. PeteA

      Semver FTW?

  9. Anonymous Coward
    Anonymous Coward

    Internationalization was always a huge PITA

    This isn't the first time (and won't be the last time) that string comparisons cause bugs. You have to read the documentation really carefully... otherwise you'll just be getting confused about which whitespace character matches which other one, etc.

    ...or just restrict your strings to ASCII and the language to English. That works too.

    1. Warm Braw

      Re: Internationalization was always a huge PITA

      ... and your currency to the US dollar.

    2. Robert Carnegie Silver badge

      Re: Internationalization was always a huge PITA

      What colour of English are you practising defence with? ;-)

    3. Loyal Commenter Silver badge

      Re: Internationalization was always a huge PITA

      Is that English (Traditional) or English (Simplified)?

      (if your spelling of internationalisation didn't give it away already)

  10. razorfishsl

    MS doing exactly what MS does.......

    Destroy existing systems & user productivity...... because.....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like