back to article Chemists bitten by Python scripts: How different OSes produced different results during test number-crunching

Chemistry boffins at the University of Hawaii have found, rather disturbingly, that different computer operating systems running a particular set of Python scripts used for their research can produce different results when running the same code. In a research paper published last week in the academic journal Organic Letters, …

  1. Anonymous Coward
    Anonymous Coward

    Fixing the symptom…

    If you make repeated independent experiments, and then do some sort of numerical analysis on the results, and the outcome depends on the order in which you feed in the results, that would suggest to me your analysis is insufficiently numerically stable and/or your methodology is unsound.

    "These 15 people would still be alive if scientists could science properly, #9 will amaze you!"

    1. Flocke Kroes Silver badge

      Re: Fixing the symptom…

      If the file names are ISO 8601 dates then there could be a good reason to process them in a particular order.

      1. bazza Silver badge

        Re: Fixing the symptom…

        If they’re relying on file names containing important data like a time stamp they’re running at risk. Far better to have important data in the file itself... In fact, having a bunch of separate files representing a single dataset is asking for trouble too.

        The whole thing is adding up to results being correct and repeatable if, and only if, all the files are in the current working directory with the right file names and there being no other files with .out as an extension. That’s not great. Let’s hope they’re not compiling any C code in the same place.

        1. Joe W Silver badge

          Re: Fixing the symptom…

          Great that you do have time to do all this. Those people would rather have code that works now, and get back to the real science. Plus most are self taught, being microbiologists or whatever.

          The file name often contains vital information because a real import script is hard to code (well, a lot harder than just reading in a table). No, it is not perfect, but many programs rely on this, and for software that is mostly written for oneself it is sufficient.

          1. Korev Silver badge
            Boffin

            Re: Fixing the symptom…

            Maestro is a programme used by computational chemists, structural biologists etc. they can all be safely assumed to be expert computer users.

            1. }{amis}{
              Boffin

              Re: Fixing the symptom…

              You are assuming that "computational chemist" is an IT-focused degree, the computation they are focusing on in this kind of degree is the maths that happens behind the scenes.

              I've had to clean code for these kinds of people a few times and their lack of experience and training in good code design is very evident.

              I refer you to the inevitable xkcd :

              xkcd 1531 "Code Quality"

              1. Tom 38

                Re: Fixing the symptom…

                The whole function is an example of their lack of programming skills! Its not a bad thing, I'd rather they were good at the science than the programming, but every time a scientist friend has said "hey, can you help check my code, I can't see whats going wrong" I prepare myself for an hour or so of "wtf... wtf.. why are you doing this?"

                def read_gaussian_outputfiles():

                ____return sorted(glob.glob("*.out"))

                [edits - there are no ways of putting code in a comment that ElReg doesn't make look shit

                1. Stoneshop

                  Re: Fixing the symptom…

                  [edits - there are no ways of putting code in a comment that ElReg doesn't make look shit

                  A [code][/code]* block helps a bit, although the return causing an extra line break even inside that block is annoying

                  * with < and >, of course.

                2. John Brown (no body) Silver badge

                  Re: Fixing the symptom…

                  [edits - there are no ways of putting code in a comment that ElReg doesn't make look shit

                  10 PRINT "OH YES THERE IS!"

                  20 GOTO 10

                3. Sam Liddicott

                  Re: Fixing the symptom…

                  And what locale or collation rules? Is a before B?

                4. Charlie Clark Silver badge

                  Re: Fixing the symptom…

                  Except single-line functions are the spawn of the devil!

            2. Pascal Monett Silver badge

              Re: Fixing the symptom…

              They may be expert computer users, that doesn't mean that they are good programmers. Their job is science, not programming, and they generally don't have the budget to employ actual professional programmers.

              1. Jakester

                Re: Fixing the symptom…

                Just because a professional programmer is hired is no guarantee a program will be designed/programmed properly or the program will work as intended.

          2. Steve Crook

            Re: Fixing the symptom…

            > Great that you do have time to do all this. Those people would rather have code that works now, and get back to the real science.

            Great that you do have time to do all this. Those people would rather have code that **looks like it** works now, and get back to the real science.

            There, fixed it for you.

            If the only way you've got of sharing your results is through your code & data then the code is *part* of the science. By sending out buggy code you're distracting all the other scientists who each find discrepancies, can't use your data, then have to refer back to you to find out WTF is going on. So you're probably preventing even more science from being done by wasting their time...

            </rant>

            1. ibmalone

              Re: Fixing the symptom…

              Well of course, but if you have a solution to this then please get in touch with the MRC!

              In practice the groups that build this type of software, at least in my field (and particularly the US ones who have more money), do tend to be the more development-oriented ones, things like regression and unit testing do happen. But smaller groups, or groups for which this isn't their primary focus, just don't have the time or skills-base, and the incentives (staying employed) are for publishing your results, not for spending time trying to show they're wrong. Against this background we have the growing demands for open science and publishing the software, which is a good thing, but you can't then criticise those doing it for their code quality.

              Edit: what we're seeing here is actually the scientific method and open source in action: another group tried to replicate results, found they couldn't, identified the issue and disseminated it.

        2. Korev Silver badge
          Boffin

          Re: Fixing the symptom…

          Here each chunk of data was stored in two files that the script needed to match and didn't because of the way that each OS returned the list of files to Python. If the instrument put the lot into HDF5 then it'd probably be better.

          The output from Ilumnina sequencers is actually stored in the same way, each samples' data is in a pair of "fastq" files.

          1. Bronek Kozicki

            Re: Fixing the symptom…

            two files that the script needed to match and didn't

            That means that the script is faulty for not reporting an exceptional condition when it should, e.g. erroring out because of the file mismatch. Perhaps someone did not think that this might happen, but hopefully by now they do.

            1. Simon Harris

              Re: Fixing the symptom…

              It appears they do now - from the linked paper...

              To overcome this issue, we amended the script to include a line in the read_gaussian_outputfiles subroutine that forces sorting before pairing and a longer file-matching check function that alerts the user when there is a potential file-matching issue (see the Supporting Information).(17) We have tested the revised script across platforms (i.e., MacOS, LINUX, and Windows) running different versions of Python, and it yields consistent results.

        3. Simon Harris

          Re: Fixing the symptom…

          Far better to have important data in the file itself... In fact, having a bunch of separate files representing a single dataset is asking for trouble too.

          In medicine we have to deal with this all the time - when a CT, MR, or other volumetric scan is recorded, if DICOM, which is the most well used standard is used, a multi-slice volume is physically transferred via a storage medium with each slice in a separate file (the files are just 'transport mechanisms' - they could alternatively be network packets) - A CT scan, which, for example, may be 512x512x400 voxels will be 400 files (+ report files and scout images) of typically a little over 512kB each rather than one big 200MB file (compression is also possible to shrink this). All the information about a slice's position, orientation, which dataset it belongs to, etc are stored inside the file. The, often, numerical filenames cannot be relied on to define any particular read order.

      2. iron Silver badge

        Re: Fixing the symptom…

        In which case the code should ensure they are processed in the right order and not assume they are because they were once on the dev's PC during a full moon.

    2. bazza Silver badge

      Re: Fixing the symptom…

      Yes, it’s a pretty big warning sign.

      I bet if they’d run it on a PowerPC based architecture they’d get different results again; different FPU, probably different arithmetic shortcuts. I know that AltiVec would make deliberate but minor errors for the sake of getting operations completed in a single cycle.

      I wonder if they’ve had that code peer reviewed by a computer scientist?

      1. James Ashton

        Re: Fixing the symptom…

        I bet if they’d run it on a PowerPC based architecture they’d get different results again; different FPU, probably different arithmetic shortcuts

        This is an OS issue, not a processor issue, and it's around the way filenames are sorted by default and nothing to do with floating point maths.

        1. Kristian Walsh Silver badge

          Re: Fixing the symptom…

          Yes. FP maths on all modern CPUs has to conform to the IEEE.754 standards. This is why the Pentium FPU bug in the late 1990s was such a big embarrassment for Intel.

          The root problem is the filesystems involved, and how they return data. None of the potential filesystems involved (NTFS, HFS+, APFS, Ext4, ZFS) gives any guarantee about the order of the entries in a directory data structure. But, because most times that returned order is basically "order of creation" I'm going to put the blame for causing the problem not on the filesystems, but on the way the files were copied to disk.

          Any system using threaded copy will cause the files to be created in effectively random order in the directory. From your graphical shell, do two copies of the same file set into two different directories, and then retrieve the file list from each (ls -1f will show you the directory entries in the same order as they were encountered on disk) - they'll most likely be in a different order in both directories if the shell has done any kind of disk-copy optimisation. In the old days, when graphical shells ran copies on one thread only, the files would get written in the order that the shell found them in, usually whatever sorting you'd chosen on the view window, but that's not true anymore, and the order is non-deterministic.

          (I first encountered this problem when copying music-files onto an SD card that was consumed by a simple player device that treated "directory order" as "playback order"... )

          All that said, the Python API should be more user-friendly, given how widely Python is used by inexperienced and 'non-programmer' programmers. As performance is explicitly not a goal of the Python language, why not sort the names in codepoint order before returning them?

          1. Luke McCarthy

            Re: Not sure the comparison is valid

            It would make sense for glob.glob() and os.listdir() to return sorted lists, to avoid these kind of platform-specific random inputs. They already filter out '.' and '..' anyway, sorting is hardly going to make any performance difference in practice.

            1. Spamfast
              Meh

              Re: Not sure the comparison is valid

              It would make sense for glob.glob() and os.listdir() to return sorted lists

              It would make more sense if people read the documentation. See here where it's quite explicit that "results are returned in arbitrary order" and here again, "The list is in arbitrary order."

              If your code requires your list of files to be sorted, then sort them - all it takes is sorted(glob.glob(x)) instead of glob.glob(x). If my piece of code does not require it, why should it have to suffer the overhead of sorting? Remember, Python glob can generate arbitrarily large lists of directory entries.

            2. Anonymous Coward
              Anonymous Coward

              Re: Not sure the comparison is valid

              I would not be keen on having the calls do anything beyond their stated function. You want results sorted in a certain way? Then sort them yourself. As someone else pointed out, naming files in a weird way and relying on that naming to properly order your data is a dumb idea. (I'd either timestamp it or load it into a database with a sequence number to keep things in the correct order. )

              Next we'll find that one's LOCALE_* settings adversely affected some research project.

              That the strange results came from two different releases of MacOS is bothersome. I'd better dollars to donuts that there was nothing in any end-user accessible release notes that noted that some functions had changed at the OS level that might have raised red flags to the more astute researchers. Same goes for the Python development team. How extensive is their testing? Couldn't their testing have caught the change?

          2. Martin Gregorie

            Re: Fixing the symptom…

            If you run Linux and use gftp for transferring fairly large files, i.e. a few megabytes each, between ext2/3/4 file systems and watch the transfer happening, you can see quite clearly that the files are not transferred in sorted name sequence. Given that directories in these filing systems explicitly do not hold files in sorted order, why would you ever expect any bulk file access function to return files sorted by name unless its documentation says that's what it does?

            IOW if your program is written in a portable language and there's any chance that it will be run under a different OS on different hardware, i.e. you've published its source, and it depends on the order in which a list of files are processed, you MUST sort the list before using it and TEST that it does what you're expecting. OK, you may define the file ordering implicitly in runtime configuration parameters, but you MUST document this to say that the parameter order matters.

            Equally obviously, if a language's standard library includes a function that can return a list of files, its documentation MUST say whether the list is sorted or not and if sorted, how the sorting is done: the natural sort order used by an IBM mainframe, iSeries or other system using EBCDIC character codes gives a very different result than one using the natural sort order on a system using ASCII or UTF8 encoding.

            Python documentatin at least says this type of list is unsorted, so you can't say you weren't warned..

          3. Peter Gathercole Silver badge

            Re: Fixing the symptom…

            The order of files in a directory is not always the sequence of creation. Several filesystem types keep the filenames in something other than a date ordered list. In this case, it will be the order that the directory structure is traversed.

            JFS2 on AIX keeps the entries in the directory structure sorted according to a particular collating sequence that is set by the base language that the system has (and re-sorts it every time a file is created). GPFS uses something like a B-tree structure in a meta-database for the directory entries. I have seen some other things, but can't remember any more off the top of my head.

            And please note that some collating sequences are not [A-Z][a-z} sequence like ASCII. I remember when iso8859 became common, and the collating sequence for this is something like AaBbCc.. etc, so when doing sorts and particular transliterations (using tr) you get some very unexpected effects.

          4. Ken Hagan Gold badge

            Re: Fixing the symptom…

            "None of the potential filesystems involved (NTFS, HFS+, APFS, Ext4, ZFS) gives any guarantee about the order of the entries in a directory data structure. But, because most times that returned order is basically "order of creation" ..."

            Microsoft's documentation for their basic directory enumeration function (FindFileNext) states that NTFS and CDFS return results in alphabetical order. In the case of NTFS, going back to NT3.1 this was a contractual guarantee (for unicode-centric values of "alphabetical"). I don't know its status now, but there is at least a generation of Windows programmers who will never have seen a directory enumeration in non-alphabetical order. (Exception: Windows Explorer deliberately re-orders files so that File10.txt sorts after File9.txt, but even in this case it is quite widely understood that Explorer had to shuffle the order to get this effect. There are probably questions on StackExchange asking for the code that does this.)

            If you have never seen the OS returning the files in non-alphabetical order, it might never occur to you that it was possible on "other" systems. It is still a bug, but for someone who doesn't do portable coding for a living it is quite understandable.

            In fact, I wonder if te quoted comment "because most times that returned order is basically "order of creation" ..." implies that the poster works on a system that behaves that way and is assuming that all other systems behave the same way.

            1. Kristian Walsh Silver badge

              Re: Fixing the symptom…

              I did say "stored", not "presented". It's a subtle difference, but I'll accept that NTFS doesn't offer any other way of getting catalog data out except codepoint-sorted, so the point is moot for software running on NTFS volumes.

              ```In fact, I wonder if te quoted comment "because most times that returned order is basically "order of creation" ..." implies that the poster works on a system that behaves that way and is assuming that all other systems behave the same way.```

              No. I was assuming nothing. I work on Linux, MacOS, Windows 10 and Ubuntu-on-WSL systems. I used that as an example of why assuming codepoint order is incorrect. NTFS does present its directory contents sorted this way, but HFS+ and most of Linux's filesystems do not.

              For what it's worth, I think NTFS does the right thing here - a consistent, predictable and (above all) logical answer from the question "what is in this directory?".

              The big surprise for people is that this is a _filesystem_ property, not an OS one. Running Linux, issuing `ls -1f` lists files in code-point order on a NTFS volume, but not on ext4 one. (As can be demonstrated by using Windows 10's WSL feature)

              Python apologists saying "but read the documents - you were promised nothing" are missing the point. Python is supposed to be a straightforward and intuitive language. Leaving these kind of gotchas in there undermines this promise, and frankly, the effort that was needed to document the unintuitive and nasty behaviour is about equal to the effort that would have been needed to make the call work the same way across all filesystems in the first place.

          5. bazza Silver badge

            Re: Fixing the symptom…

            Yes. FP maths on all modern CPUs has to conform to the IEEE.754 standards. This is why the Pentium FPU bug in the late 1990s was such a big embarrassment for Intel.

            Well, IEEE.754 is a storage format for real numbers + some rules for operations. There's no guarantee that one's CPU will actually uses it (even though it is quite common), and no guarantee that the FPU or vector unit will actually produce answers as accurate as those theoretically possible for values stored as IEEE.754, or that the FPU and vector units will produce identical results. There's been plenty of CPUs that store numbers as IEEE.754 but explicitly (that is, if one reads the data sheet in fine detail) guarantee to not achieve exhaustive arithmetical accuracy for operations performed on them. AltiVec and the Cell processor were one. I'm pretty sure that Intel have done the same, on occassion.

            The reason why this might be a factor is that arithmetical errors introduced early on in a string of calculations can have a bigger impact on the result than if they're introduced later. With CPU designs that take short cuts with FP arithmetic the error magnitude is data dependent; the same operation with different data produces a different magnitude arithemtic error. So this application (which I am of course boldy and possibly erroneously assuming is heavy on floating point math) may reasonably produce a different answer with files processed in different orders, even if the mathematics theoretically doesn't care about order, when run on a processor that doesn't guarantee exhaustive accuracy, or even consistent inaccuracy.

            Calculating the theoretical numerical error expected in the results produced by one's application by reading the CPU datasheets is hard work, and has to be reassessed every time you change the code or run it on a different model of CPU. Without this one cannot say that there's a statistically valid difference between answers such as 173.2 and 172.4, or whether either are close to being correct. This is much more likely to be a problem if the calculations involve a lot of data and a lot of cumulative calculations, which a lot of science applications do.

            I've previously seen a couple of science applications written that made errors with the use of FPUs on modern computers. Overflow is easily run into, which generally occurs silently (so there's nothing to tell you that something has gone wrong). To be confident that an application is performing reasonably correctly one has to be quite defensive, e.g. checking that input, intermediate and output vectors don't contain + or - INF, testing of extremities, etc. Checking that one's code has adequately implemented the required maths on a specific data set (i.e. not just test datasets) can be a big job, and takes an especially nerdy computer boffin who is prepared to really look at what a CPU actually does.

            You can even run into problems with library changes. Libraries that once used to be running on the FPU might have been optimised in later versions to run on the vector unit (SSE, AVX, or whatever). On vector units the arithmetic accuracy achieved can be worse than that on the FPU, in the interests out outright speed.

            Of course, a lot of applications don't care (games, etc), and that's just fine; modern CPUs are all about being good enough for the bulk of applications. However, it doesn't suit all applications. Intel and chums do not pander to the 0.01% of users, they design to impress the 99.99%. The result is that almost all developers out there, whether professional software devs or scientists competent in python, never have to give a moments thought to whether or not floating point arithmetic in their applications is actually correct, because for the vast majority of time it's Good Enough.

            This sort of problem shows up in various areas of computing. Financial calculations generally avoid IEEE.754 altogether, IBM going so far as to have a decimal representation of real numbers in their more modern POWER architectures, and adding in a decimal math co-processor for good measure. There's even a GNU library for such representations. When dealing with $/€billions, the precision of IEEE.754 is not adequate to get sums rights down to the cent, and one thing that annoys a banker when converting dollars to euros by the billions is getting the cents wrong.

            1. Whitter

              Re: Fixing the symptom…

              If your algorithm cares about the the specifics of FP representation beyond 'double', you probably want to refactor that algorithm. No doubt sometimes there's nothing you can do. Personally, I'm 25 years in and I've only had to use the decimal type once.

              1. Claptrap314 Silver badge

                Re: Fixing the symptom…

                I put a lot of blood, sweat, and tears into proving that you get those EP bits. Please go ahead and use them.

                True, if you aren't using stable algorithms, it doesn't matter how many bits of precision you have. But those EP bits are real.

                But yeah, don't use the decimal type.

            2. Claptrap314 Silver badge

              Re: Fixing the symptom…

              Most modern microprocessors explicitly DO guarantee that they comply with IEEE-754, and everyone one that I've seen that does not includes in the manual library code to generate IEEE-754-compatible results, allowing them to claim that they do. (IEEE-754 specifies that it may be implemented by a combination of hardware & software.)

              Furthermore, outside a double-handful of corner cases that Intel & IBM would not agree on, IEEE-754 DOES specify precise results for all operations.

              Yes, bugs do occur, but (at least for the decade I know) far more occur in caching (including branch prediction) than in the FPU.

              --

              EVERY operation in the IEEE-754 standard is subject to rounding and overflow. But the standard also specifies that exceptions can be enabled for these cases. Yep, if you don't read the docs, things can go bad.

              And yep, rounding is a thing. And properly designed algorithms will handle it properly. The theory for doing that is actually pretty interesting.

              --

              I was there when IBM started working on the decimal format. I REALLY don't trust it. The guy leading the effort had no interest in formal methods. Expect nasty, nasty failures. And IBM will hush them up because of the nature of the POWER market.

              Just use integers for money.

              --

              Yeah, I'm that exceptionally nerdy computer boffin. Shame that IBM management tended to shoot the messenger.

          6. Vincent Ballard
            Coat

            Re: Fixing the symptom…

            It's not as simple as "FP maths on all modern CPUs has to conform to the IEEE.754 standards". For example, Java has a keyword strictfp to force evaluation in accordance with IEEE 754 standards for code which needs consistency. If you don't use that keyword then, on some (nowadays probably most) processors it will use optimisations like fused-multiply-add which aren't defined in IEEE 754. This can result in inconsistencies between platforms.

            1. Claptrap314 Silver badge

              Re: Fixing the symptom…

              You are technically correct....

              Yes, IEEE-754 does not specify MAC operations (which is criminal). However, using MAC operations generally is a huge boost not only to speed but stability. READ THE DOCS. If the manual states that the processor correctly rounds the infinitely precise results back to the representation format according to rounding control, then USE IT. MAC is used internally on two-operand multiplies, which means that the formally checked proofs will necessarily have validated the correctness of the underlying hardware along with the usual two operand ops. Validating the MAC opcode will then be near-trivial.

          7. Claptrap314 Silver badge
            Boffin

            IEEE-754 is NOT "standard"

            The IEEE 754 committee included Intel & IBM who were each unwilling to change their implementation. There are numerous corner cases where they differ, and you can spot these in IEEE-754 when it uses the word "may" when specifying results.

            For the record, I was working for AMD, so we had to match Intel. Guess on which side my intuitions lied? :D

          8. david 12 Silver badge

            Re: Fixing the symptom…

            Yes. FP maths on all modern CPUs has to conform to the IEEE.754 standards.

            Yes, and there are unavoidable 'implementation dependant' details in in the IEEE.754 standard.

        2. Sam Liddicott

          Re: Fixing the symptom…

          Really?

          How many OS do you know of that sort the results of readdir?

          That would require the OS to either read all of the filenames before returning any of them, or that the FS maintain an order, and most FS do not.

    3. tfewster

      Re: Fixing the symptom…

      I suspect the files are being listed in inode order, and you could get different results on the same system with multiple runs, or on different filesystem types.

      1. Anonymous Coward
        Mushroom

        Re: Fixing the symptom…

        > I suspect the files are being listed in inode order

        More likely the btree walk order - which would vary from machine to machine depending on the order the files were created/copied. Which would depend on the order they were read from the original source, which would depend on the the btree walk order for that device, which would etc etc

        [exploding head icon ->]

      2. This post has been deleted by its author

      3. Peter Gathercole Silver badge

        Re: Fixing the symptom…

        On a UNIX-like system, the i-node order matters not one jot. Each directory entry contains a file-to-inode mapping, and the directory file is what is accessed when you look at a directory contents, and the i-nodes appear in any order.

        On almost all UNIX-like systems, when you look at a file in a directory, the named entry is looked up, and then if more than just the name is required, the i-node number is fetched from the directory entry, and the i-node read to get the rest of the required information (which is pretty much everything about the file other than it's name).

        If your OS has it, you may be able to use the ncheck (ancient UNIX command predating fsck, along with icheck and dcheck) utility to read the inodes in sequence.

    4. gnasher729 Silver badge

      Re: Fixing the symptom…

      Nothing to do with numerical stability or hard stuff like that at all. They used a function that is defined to return filenames in arbitrary order. Unfortunately when they tested the code, filenames were returned in alphabetical order by coincidence. And using the code for two directories with related files returned the same files in both directories in the same position- by pure coincidence. They tried to map input and related output files. On a system without the sorting coincidence, the wrong input and output files were combine

    5. Mark 85

      Re: Fixing the symptom…

      "These 15 people would still be alive if scientists could science properly, #9 will amaze you!"

      Just curious what this is about. Nothing in the article. Looks like clickbait from here but no link. WTF?

  2. Jemma

    Stating the bleeding obvious part #261

    I'd never thought of this issue before but it makes perfect sense. Every OS does things on its own certain way and where small differences in complex calculations are important - it will have a greater or lesser effect. I wonder what result you'd get from Windows ME?

    This is worrying for a whole lot of reasons - efficacy tests of drugs like antidepressants - the statistics can be very close between activity and "basically a placebo" is just one example. These drugs can be unpleasant and in some cases include raising the risk of suicide, so an artifact .3 or .5 in the wrong direction (ie towards efficacy) might be enough to allow the manufacturers to dump something on the market that doesn't work and puts people at risk (not that they aren't fudging the results already, see "the emperors new drugs".)

    I wonder also whether hardware might have an effect on the results (other than firmware bugs) - or whether older computers might have been much worse - for example the Inbredistanis nuclear H bomb test that decided to produce 6x the output power and doused a Japanese fishing boat in radioactive coral (so technically Inbredistan has nuked the Japanese thrice, because radiation got onto the Japanese mainland in that incident (from landed fish)). Its always been claimed to be an unexpected consumption of lithium fuel - but what if it was 'unexpected' because the computer simulations and calculations had a brainfart?

    1. Remy Redert

      Re: Stating the bleeding obvious part #261

      If they ran that on Windows ME they'd get the same result you always get on Windows ME.

      A blue screen of death

      1. Anonymous Coward
        Anonymous Coward

        Re: Stating the bleeding obvious part #261

        I think you are mistaking WinME for Win98.

        I used WinME for a decade on nursery childrens computers and never had a BSOD, no matter what the little sods did (swapping language mid game, inverting the screen etc).

        WinME's biggest issue was the lack of device drivers - which can be laid at the doors of all the 3rd party hardware manufacturers.

        The exact same thing happened with WinXP64; stable OS, no drivers.

        Based on this story, there seems to be merit in programming for specific hardware and not having an OS floating on top.

        1. Pascal Monett Silver badge
          Trollface

          So you were that guy using WinME ?

          I always wondered who they were talking about.

          1. StripeyMiata

            I know someone who paid £60 for an upgrade disk/license to upgrade their Win98 PC to WinMe.

        2. heyrick Silver badge

          Re: Stating the bleeding obvious part #261

          Downvoted because I used 98SE for ages and it was absolutely rock solid.

          The other computer, running ME, couldn't be trusted to boot without problem. And that was a regular machine, nothing weird, and a plain install of ME. It was just awful. It's why I stuck with 98SE for so long...

          1. Peter Gathercole Silver badge

            Re: Stating the bleeding obvious part #261

            I upgraded a system from Windows 95 to Windows ME so that I could get some 3D drivers to work for some graphics card that did not have them for Win95, but did for WinME.

            I actually found it pretty stable, especially once whatever game was running took over the system.

          2. Pirate Dave Silver badge
            Pirate

            Re: Stating the bleeding obvious part #261

            98SE was good once you got all the non-Plug-n-Pray interrupts worked out, and assuming you weren't using a soft-modem with a flaky driver.

            And I thought the big problem with ME was that MS had ripped out all of the "enterprise" networking stuff like their Novell Client and NT domain stuff, in an attempt to make it more "consumer" focused.

          3. Jakester

            Re: Stating the bleeding obvious part #261

            I too hung onto Win98SE for a long time. The computer originally came with WinME, but I struggled for almost a year before I tossed in the towel and upgraded back to 98SE. On the other side of the coin, I have a brother who bought a WinME computer about the same time as I bought mine. He never had a bit of trouble with ME. However, after a few years, we did take advantage of the Win 7 upgrade option that came with his computer. The computer hummed along fine with Win 7 and we took advantage of the free upgrade path to Win 10 a couple years ago. The computer is still working fine, though a little sluggish as it is one of the old Core 2 Duo processors (I forgot which model).

    2. Korev Silver badge
      Boffin

      Re: Stating the bleeding obvious part #261

      This programme is normally used right at the start of discovering a drug; there would be years of lab testing before it even hits a human for the first time. There would then be multiple rounds of clinical trial, usually at the late stages of drug development there would be multiple statisticians working independently on the data ensuring that they can get equivalent results to each other.

    3. Anonymous Coward
      Anonymous Coward

      Re: Stating the bleeding obvious part #261

      Castle Bravo, IIRC, arose because the Americans were so anxious to get an H-bomb they didn't really know what they were doing. Better simulation wouldn't have helped.

      The information Philby extracted from the US enabled the USSR to build the King of Bombs - they were shocked because it produced far more energy than they expected, so much so that Krushchev's family reported that he was horrified by it.

      I don't think a better computer would have helped. Accidentally irradiating Teller in 1944 might have done.

      1. HorseflySteve

        Re: Castle Bravo

        It wasn't so much not knowing what they where doing, but the fact the Lithium 7 (which is more common that Lithium 6) had not, up until then, been observed to participate in the generation of Tritium. Castle Bravo was the experiment the showed them it most certainly did when under intense neutron bombardment and even increased the neutron flux by emitting a fast neutron as it split. The end result was they got 3x as much Tritium to fuse as they expected and consequently around 3x the BANG!

    4. Intractable Potsherd

      Re: Stating the bleeding obvious part #261

      @Jemma: I was thinking exactly the same. There is a historical tendency for data sets from commercial trials not to be released, so re-analysis isn't done. Your example of SSRI antidepressants is a good, and valid, example of where a small difference in the results can make a huge difference to the outcomes.

  3. Anonymous Coward
    Anonymous Coward

    Science

    Startin Python course for science research purposes.

    Guess I’ll be sure to use the same hardware and OS when actually running scientific significant code.

    1. oldtaku Silver badge
      Headmaster

      Re: Science

      That's the wrong response. You have to make sure that if there's some STRONG assumption in your code, like the ordering of files, that you enforce that.

      If you only run the same hardware and OS every time then you might miss that it's completely wrong because you're making the same wrong assumptions every run.

      1. Kristian Walsh Silver badge

        Re: Science

        Agreed. This problem has nothing to do with hardware or OS. It's an error in thought. Assuming that the external input always conformed to an implicit pattern is a dangerous assumption, but only a lot of experience with filesystems and Python - much more experience of programming than I'd expect a full-time research biochemist to possess - would reveal this.

        Software developers know (or should know) about input sanitisation, but if you're looking at a window that's listing your files, and the files are nicely ordered by name, you have no reason to suspect that the little program you just wrote will load them in any other order than that.

        Simply doing a .sort() on the directory contents would fix the issue, of course, although I personally think that given Python's user-base and its principle of favouring "proper" over "fast", sorting the returned filenames by code-point should have been the built-in behaviour.

        1. Roland6 Silver badge

          Re: Science

          >Agreed. This problem has nothing to do with hardware or OS. It's an error in thought.

          Funnily, while reading all the (self-righteous) comments here, I realised how much this research reminded me of firstly of my own experience of C/Unix file operations where somethings were (still are?) implementation-dependent and Agile software development... where much is focused on getting the core functionality working and only later going back and making it more robust.

    2. Bill Gray

      Re: Science

      Please use multiple machines with a range of OSes and compilers.

      I write software for determining asteroid orbits from observations, and their probability of impact with the earth and where they will hit. I compile and run the code on Linux, FreeBSD, and Windows. I use GCC, clang, MinGW, and Microsoft C/C++, with full warnings on each. If I get different answers, something is wrong and needs to be investigated.

      You may have heard that "a man with one watch knows what time it is; a man with two watches never knows what time it is." More realistically, the man with one watch is certain what time it is, but probably wrong.

      If you tell me you did scientific research and got a result, and did as little checking of that result as possible for fear you'd get a different answer, my trust in your result will be nearly zero.

  4. oldtaku Silver badge
    Megaphone

    I think they're deciding which OSes 'failed' wrong.

    If your algorithm really depends on random files being loaded in some specific order you had better make dang sure you sort those file names before loading.

    I think this has less with what OS you're using and more with how you copied the files into the directory. If you unzipped a file you will always get the right results, because the files in a zip have a fixed order, but if you checkout the files from your repository, or checkout then copy them to a directory the order may be semi-random.

    For instance they claim Windows 10 worked right here, but I know from experience that python glob on Win10 can return a different order depending on the real (non sorted) order of files in a directory. They just got lucky when they did it based on how they got those files there.

  5. sbt
    Facepalm

    "the order in which files get compared affects the results"

    That seems like a pretty obvious bug. If it isn't, then the code failing to enforce the correct order by sorting after the glob, is the bug.

    The use of Python seems immaterial to this. Scripting languages are not known for their numerical precision, mind you.

    1. Korev Silver badge

      Re: "the order in which files get compared affects the results"

      Most Python people doing serious numerical work would be using Numpy, Scipy etc. for the heavy lifting.

    2. Frumious Bandersnatch

      Re: "the order in which files get compared affects the results"

      The use of Python seems immaterial to this

      I think not, actually. As it happens, by chance (not related to reading the article), last night I was looking at a similar bit of code that I wrote in Perl a while back. I was worried that the output of Perl's sort routine would vary depending on the locale in use. After checking the documentation, it seems that it doesn't. It uses the standard C locale, and you have to explicitly add a pragma (well, "use" line) if you want lexicographical sort order. That is, if you want "Adam" and "adamant" to sort next to each other. Fortunately, it seems like my code didn't have this bug.

      So where I'm going is that it's actually a Python problem rather than an OS problem. Or just a "programmer failing to read the manual" problem, but I think that you can also blame the design of the language routine for not producing deterministic output by default...

      1. sbt
        Alert

        There's a problem with my python I need to get sorted

        You'd love C. So much undeterminism and implementation dependence.

        For what it's worth, I'll agree with you if the code invoked a sort routine after globbing, and the sort results were somehow platform dependent (despite identical locale settings, if applied).

        Or if the contract for the Python glob function promises a particular order. Last time I wrote any directory access code in C++, I seem to recall getting "directory" order, the order of the file/inode entries in the directory. Pretty worthless.

        1. Simon Harris

          Re: There's a problem with my python I need to get sorted

          "Pretty worthless"

          When I'm using file manager, depending on what I'm doing, I might sort by date, file-type or alphanumerically, and I might choose ascending or descending order.

          Similarly, why should a language's file system API presume to know what sort order an application wants (if any) - most languages have sort functions in their libraries these days- why not let the developer choose the most appropriate for the application?

          1. Doctor Syntax Silver badge

            Re: There's a problem with my python I need to get sorted

            "Similarly, why should a language's file system API presume to know what sort order an application wants (if any)"

            Fair enough. But one of the functions of a high level language is to present the programmer with a platform that isn't dependent on low level stuff. That move started when symbolic assemblers replaced writing raw machine code - or possibly when people started writing machine code instead of using plug-boards and switches or whatever. It's not unreasonable to expect a language to offer an API option that at least provides a consistent order independent of platform.

          2. sbt
            Thumb Up

            Pretty worthless

            On its own, as a sort order. If I need all the files but don't care which order, then the order of storage is as good as any other and may perform slightly better depending on storage medium.

            If I do need some order, I agree that the file system API need not (nay, can not) assume. However, if there's an optional parameter that allows order to be specified that is implementation independent, that could streamline matters. Particularly since the returned structures from globbing functions can be more elaborate than an array of strings that can be dumped into a general purpose library sort function, and your examples show the need to sort beyond the filename.

      2. Baldrickk

        Re: "the order in which files get compared affects the results"

        Definitely not a python issue. It doesn't perform sorting when globbing. It's documented that it doesn't sort. It shouldn't sort because sorting is trivial in the case that it's required and unnecessary overhead in the case that it isn't.

        It's not an OS problem either. There isn't a specification for order of files in a directory.

        The issue here is in the assumed constraints not being enforced by the programmer.

      3. disgruntled yank

        Re: "the order in which files get compared affects the results"

        Well and good, but I see no sort routine in the code snippet.

      4. FIA Silver badge

        Re: "the order in which files get compared affects the results"

        So where I'm going is that it's actually a Python problem rather than an OS problem. Or just a "programmer failing to read the manual" problem,

        No, it really is just a programmer failing to understand/read the docs problem.

        but I think that you can also blame the design of the language routine for not producing deterministic output by default...

        That's just a misunderstanding though. The results may not be what you expect, however they are deterministic, it's just that determining requires far too much information about the underlying system in use, hence the ordering is documented as undefined. (If you knew the underlying filesystem in use, and the full info for each file, and how the filesystem stores the metadata, and how it's readdir implementation walks those data structures you'd be able to work out the order the files would be returned.)

  6. Oengus

    Flashback

    I remember an issue years ago where some of the Intel Maths co-processors had a bug in the floating point unit and so would, under certain circumstances, give incorrect results at the CPU level. This meant that you thought you were running the same hardware and software but because you had a buggy co-processor you got incorrect results. I even remember that eventually there was a simple test you could run in Excel to identify if you had the buggy co-processor. Fortunately none of my critical code ran on the Intel platform...

    1. Flocke Kroes Silver badge

      Re: Flashback

      Intel's first response was to do nothing because the errors were sufficiently small that they would not matter to the majority of customers. They did eventually offer some sort of replacement program, perhaps in response to:

      I am Pentium of Borg. Prepare to be approximated.

      1. Jay 2

        Re: Flashback

        I believe that was the Pentium 60 and 66 (I ran recall the uproar at the time). The 75 onwards was OK.

  7. nekomoto

    This is a standard numerical analysis issue

    The core of this problem, is that with floating point calculations precision is often lost.

    Consider the case of summing a (metric) truckload of small values, they *may* eventually add to something significant.

    But with the case of adding a tiny value to an existing large floating-point number, it can quite often be rounded-away to insignificance.

    So the common solution is to perform your arithmetic on the "finer details" first.

    Sure sorting the input-filenames gives you the same result (at least on systems using the same IEEE-754 floating point hardware), but what they should be striving for is the most accurate result. If the ordering of the input files has a significant effect on this, then this isn't happening.

    1. Anonymous Coward
      Anonymous Coward

      Why the down votes?

      'nekomoto' has identified what is likely to have been the real cause of the issue.

      1. Kristian Walsh Silver badge

        Re: Why the down votes?

        I didn't downvote, but I suspect it's because the article explicitly says that the problem is due to file ordering, so it's a "RTFA" reaction. Secondly, research scientists know a lot about numerical precision, so that would have been ruled out early. Also, the poster assumed that the mathematics was floating-point, when almost all scientific modelling uses decimal-based mathematical libraries that don't exhibit this kind of rounding: the most commonly-used maths packages can work at thousands of digits of precision as long as you've got enough memory. Finally, even if it was FP rounding, the scale of error reported is far beyond what you'd expect from FP rounding, even on huge data sets.

        1. WilliamBurke

          Re: Why the down votes?

          Agree, except for the "maths packages". Scientific simulation code is generally written in Fortran or C/C++, with 8-byte Reals, compiler-optimised as far as possible without breaking the test set, and may still run for weeks on hundreds of cores to produce a result.

        2. Vincent Ballard
          FAIL

          Re: Why the down votes?

          "Finally, even if it was FP rounding, the scale of error reported is far beyond what you'd expect from FP rounding, even on huge data sets."

          You're assuming linearity. If the code has any kind of foo() if x >= 0 else bar() then FP rounding differences can cause arbitrarily large errors.

    2. Jemma

      Re: This is a standard numerical analysis issue

      To be honest if your developer has set up a system to do this sort of work and then made it so it's dependent on *file names* as any form of control - might I venture to suggest that a cattle prod, some quicklime, a spade and some soft ground is a good idea.

      I can think of at least one situation where an OS could happily banjax file names in less than a millisecond - that wonderful way long filenames in Windows of various versions could be royally buggered up if you happened to try editing them in a DOS box (one of several reasons I suspect why most windows systems files are still in the 8.3 format.

      It's still risible coding and design - just for a different reason.

  8. henklaak

    Scientists are script kiddies.

    I have the pleasure of being a SW dev amongst 50 physicists.

    What they come up with is sometimes diabolical.

    I've encountered a case where numerical analysis was so unstable, that it produced different result on different processors.

    Inverting a matrix with condition number 1e60 doesn't seem to bother them at all.

    H.

    1. Oh Matron!

      Re: Scientists are script kiddies.

      Upvoted for the use of the word "Diabolical" in a nonbondian situation.

    2. Frumious Bandersnatch

      Re: Scientists are script kiddies.

      Inverting a matrix with condition number 1e60 doesn't seem to bother them at all.

      That's why I do all my calculations on finite fields. </smug>

    3. ibmalone

      Re: Scientists are script kiddies.

      I have the pleasure of being a SW dev amongst 50 physicists.

      What they come up with is sometimes diabolical.

      As a physicist I can only say:

      Bwahahaha!

      In other "news" (quotes because 2012): FreeSurfer (neuroanatomical package) result differences by software and OS version. In that case I suspect the OS differences come from compiler optimisation and numerical libraries (obviously the result changes between different versions of the software are another matter, and are sometimes part of the point), I've seen similar things with compiler choices about instructions sets making a difference.

      Have also previously run across an instance where the exact issue mentioned in this article was helpful: the software in question produced quite different results on different machines, and it really shouldn't have. On closer inspection, of a lot of files that were supposed to be averaged, just the first one was actually being used, which was a different file on different machines. As this was only used to derive a ballpark starting point the resulting output difference was tiny, but as all other things were equal it seemed worth tracking down (and sort of was, as the conclusions from re-running were utterly unmoved).

      My personal bête noir is that nobody, even people from engineering or computer science backgrounds bothers with online average/variance algorithms, and when you point out it's an issue the likely response is just to change to 64bit (which, even more annoyingly, is usually enough).

      Actually, in python my recent favourite might be scipy.signal.wiener, which took up more of my time than it really should have. You see, in signal processing, Wiener filter usually means an application of the Wiener deconvolution. In image processing there's sometimes another (I think less common) meaning, which is basically a local mean filter. Confusingly scikit.image.restoration implements the Wiener deconvolution, scipy.signal.wiener implements a local mean filter. (It should probably be a clue that the later doesn't let you supply a PSF, but you wont actually find out unless you inspect the source.) This also means there's no scipy 1-D implementation of the deconvolution (a relatively common application), so you'll have to roll your own (not that hard, but plenty of scope for errors), and providing a source of stackexchange questions where I'm not convinced anyone actually realises the difference.

      1. Claptrap314 Silver badge

        Re: Scientists are script kiddies.

        Some of the hyper-optimizations can actually randomize the resulting code. Therefore, it is possible to get different results after compiling the same code with the same libraries on the same system.

        But...if your code cares at that level, then your reporting of results cannot. If you've not dug into the implementation of floating point divide anywhere, please do. I expect it will give you some useful ideas for structuring your computations.

    4. Claptrap314 Silver badge

      Re: Scientists are script kiddies.

      What's really going on is that you are seeing what happens when a hyper-gifted imagination gets a little knowledge.

      The code that I first wrote after leaving mathematics was...truly special. One of the steps along the way to becoming a software engineer was to aim for boring code.

  9. Anonymous Coward
    Anonymous Coward

    Different fields

    At least it was a processing mistake rather than the much more depressing collection mistake. Researchers - please hire good computer scientists when you start your project.

    1. Anonymous Coward
      Anonymous Coward

      Re: Researchers - please hire good computer scientists when you start your project.

      But we have barely enough money to hire ourselves!

      1. ghp

        Re: Researchers - please hire good computer scientists when you start your project.

        Odd. Last week a couple paid 2.000.000€ for an injection that may save their daughter.

        1. Bronek Kozicki

          Re: Researchers - please hire good computer scientists when you start your project.

          I think you are being unfair to the team effort developing whatever medication you have in mind.This was most assuredly not a one-person endeavour.

          1. Jemma

            Re: Researchers - please hire good computer scientists when you start your project.

            I'd think it's the young Battens girl, although even if it stops progression it's hardly going to do her many favours, the state the poor kid is in already. I can understand the parents not wanting to lose her but sometimes I wonder whether it's right to fight for someone to keep existing with pain and a very limited quality of life. I always wince a little when I read these kinds of story.

            It looks likely that it may not be a lock anyway since her seizures have started to climb again.

            Think "Lorenzo's Oil" without the acting talent.

            1. Anonymous Coward
              Anonymous Coward

              Re: Researchers - please hire good computer scientists when you start your project.

              "...but sometimes I wonder whether it's right to fight for someone to keep existing with pain and a very limited quality of life."

              There is a mindset that life itself is innately superior to any other experience and that suffering does not take away from this and may actually improve it. Thus you have those who prefer to submit to evil and bide their time rather than fight to the death.

        2. ibmalone

          Re: Researchers - please hire good computer scientists when you start your project.

          There are many possible answers here about how the drug discovery and development industry works, but the most pertinent one is that a relatively small chemistry department at a US public university is unlikely to be seeing much, if any, of that money.

          1. ghp

            Re: Researchers - please hire good computer scientists when you start your project.

            The latter. The money goes to Novartis, not the researchers that came up with the possible cure.

  10. Anonymous Coward
    Windows

    Regression testing

    When Luo ran these "Willoughby–Hoye" scripts, he got different results on different operating systems...

    Frankly, anybody who designs, distributes, or uses software packages which do not include an automated regression testing component is simply asking for trouble. This is especialy true when the package relies on any external functionality - including any dynamic libraries, scripts, external services, or, in this case, unspecified behaviour of system calls which was accidentally relied upon by the original developer.

    If the results produced by the software package are important, one should also re-run the regression tests every time the system is updated or undergoes repairs or changes of any kind. It is also necesary to avoid any non-essential external dependendencies. When they can't be eliminated, such dependencies shoud be clearly documented, and if at all possible packaged together with the application, so that they won't be unintentionally updated, changing the results. Under no circumstances should such dependencies be updated dynamically, while in production use.

    I realise the "modern" trend is to do exactly the opposite, and to pull in a gazillion external libraries and scripts whenever possible - I'd rather have my software functioning correctly, or at least predictably, than have it up-to-date.

    1. JoeCool Bronze badge
      Angel

      Re: Regression testing

      Yep. I was going to post "let me google python unit test framework for you. huh look at that".

      I appreciate, once again, scientific proof that programmign is indeed a skill that is not posessed by everyone capable of running a script.

  11. STOP_FORTH
    IT Angle

    Language question

    Slightly OT, sorry. I realise there will soon be a gazillion computer "dialects" out there and am looking for advice on which one to use.

    I want to be able to open biggish files (say less than 20 GB, maybe much less), grab a largish chunk of hex data and parse it and also be able to cope with big (42 bit) integers. Cope with, in this context, means subtract or compare, or possibly multiply by a small integer. I would never need to multiply two 42 bit numbers.

    The hex data extracted from the file is often not in a fixed length or field format and may consist of binary flags, ASCII or Unicode data, big numbers, run length codes, bitmaps, all sorts of garbage really.

    Modern languages seem to be strongly biassed towards handling text, and do weird things if I try and import a lump of hex data.

    There are "professionally written" applications for analysing these files. They are all deficient in some way.

    Please don't say C.

    (Long time ago I used to dabble in FORTRAN, BASIC and 6502 Assembler. Sometimes I used to be paid to do so, but I am not a professional programmer.)

    1. Duncan Macdonald

      Re: Language question - Two programs

      I would be tempted to use 2 programs - the first to take the input file and extract the wanted data (still as hex text) into a second clean file and the second to do any required processing on the hex data. You said that you used FORTRAN in the past so use that for the numeric processing (the Z field in FORMAT handles hexadecimal) and INT*8 is 64 bit so will handle 42 bit numbers easily.

      For the first program use whichever string processing language you are most familiar with (awk, Perl, SNOBOL etc (If you are a masochist then you could use TECO !!!)). Familiarity is probably more important than the precise language to reduce the learning curve.

      Keep the intermediate file - a human examination will often show the problem when there are weird results.

    2. Kristian Walsh Silver badge

      Re: Language question

      My general advice agrees with Duncan's, above: split the extraction and manipulation into a different tools as much as possible: make your extractor export the data to a useful intermediary format (CSV, for instance, can be pulled into spreadsheets, but I know nothing about your format, so maybe that won't suit). You can then write other tools to work on that intermediary format, and use a third tool to re-assemble it into the packed form if needed.

      C#, Java or Python can all do what you're looking for (so can C, but it is much harder, and you said "no C"). Python can deal with unpacking binary data, although the mechanism is a little odd (look up the Python 'struct.unpack' package). I find C# (and the .Net/Mono libraries) to be better at getting down and dirty with binary data than Java, but that may be my own prejudices against Java in general.

      Once you've got the data out, Python will transparently deal with the largeness of the numbers (for integer values over 32-bits, it silently switches to a decimal-based multiple-precision "bignum" type). In the other two languages, use library "big integer" functions rather than the built-in long or int types (In C#.Net or Mono that's System.Numerics.BigInteger, in Java, java.math.BigInteger).

      Given the limited information on what you're doing, I'd lean towards Python in this case: extracting the info will be a little harder, but manipulating it afterwards may be easier.

      However, if this is something you're hoping to grow into a larger application, then don't go down the Python route - python programs can become very difficult to maintain when they get beyond a couple of thousand lines of code. Keep each program small, with a clear purpose, and resist the urge to attach unrelated features to an existing script.

      1. Baldrickk

        Re: Language question

        "python programs can become very difficult to maintain"

        More than any other language? Please elaborate.

      2. thames

        Re: Language question

        @Kristian Walsh - Just to address your point about how Python handles integers, what you describe is how Python version 2 did things. Starting with version 3 all integers are "long" integers, there are no more "short" (native) integers.

        What this means is that there are no longer unexpected results caused by integers getting promoted to long (this wasn't a problem for the actual integer value at run time, but more one of the type changing causing complications when inspecting types in unit tests). Eliminating the need for Python to check whether an integer was a "short" or "long" integer each time an object was accessed meant that there was no performance penalty to simply making them all long.

        For the sake of those who don't know what the above means, with Python it is impossible for integers to over flow because they are not restricted to the size of native integers but can be infinitely large. This is a big help as you don't have to add integer overflow checking to your own code.

    3. STOP_FORTH
      IT Angle

      Re: Language question

      Thanks Duncan and Kristian. Splitting extraction and decoding sounds like a good idea except that each hex Byte can take any value between 0 and 255. Won't that make tools that are expecting text choke? I'm not trying to turn the comments into Stack Exchange,honest!

      1. alisonken1

        Re: Language question

        Not really. In most programming languages, you can specify

        1) Extract the bits as hex binary, export as hex/decimal text

        2) Decode text input as hex/decimal as you prefer

        With python, a text string can be checked if it's a valid number before processing, and I'm pretty sure most other programming languages have a built-in/library function that can validate text input as numeric values and convert from text to int as needed.

      2. MJB7

        Re: Language question

        Ah. You've got *binary* data. By "hex data", I think everyone assumed you meant you had a file full of digits and the letters 'A' to 'F'. Python, C#, and Java can certainly all handle binary data (and so can C of course). It is *still* worth writing a program to extract the data of interest, and then write it out again - and you can write it out in text format. In fact, it may be worth writing a program to take the whole thing and convert it to a text format (CSV, JSON, etc), and then write another program to take that text format and extract the data of interest, and then a final program to do the processing.

        The intermediate files are less efficient, but they make debugging *so* much easier.

        1. STOP_FORTH
          Unhappy

          Please stop everybody!

          You are quite right, it is binary data. Thanks to all of the replies, but I think my original question was very badly worded and has confused everybody.

          It is binary data that is often/usually not byte-aligned. Converting it to text/ASCII/whatever will result in some fields being spread across two or more bytes. Fields are not necessarily a multiple of whole bytes and may be less than a byte. Fields may be variable length and may also contain several different types of data. Bits may encode one or more binary flags, integers, pointers to end of field, ASCII, Unicode, several flavours of what can only be described as mangled ASCII, mangled ASCII encoded into ten bits (don't ask!), 33 and 9 bit clocks coding up one or more 42 bit clocks, variable length codes, CRCs, FEC codes and probably a lot of stuff I have forgotten.

          I am never going to shove this gallimaufry of data into awk or similar!

          Thanks again to all, I will slink away now.

          1. A.P. Veening Silver badge

            Re: Please stop everybody!

            No problem at all, as long as a defined set of fields together has a fixed/structured format or at least a type and length indication. Once you have that, automatic extraction of individual values is relatively straight forward.

            1. STOP_FORTH
              Happy

              Re: Please stop everybody!

              Structured but not fixed! Depending on the data a field might be indicated with a flag and a length value or a TLD (aka TLV?)

              Anything optimised for text probably won't suit.

              1. A.P. Veening Silver badge

                Re: Please stop everybody!

                Anything optimised for text probably won't suit.

                Seems likely, but that is another discussion.

              2. Anonymous Coward
                Anonymous Coward

                Re: Please stop everybody!

                Might I recommend, looking at ruby, specifically pack and unpack.

                Lots of good examples of usage https://blog.bigbinary.com/2011/07/20/ruby-pack-unpack.html

                binary_data = "\x05\x00\x68\x65\x6c\x6c\x6f"

                length, message = binary_data.unpack("Sa*")

                # [5, "hello"]

                https://www.rubyguides.com/2017/01/read-binary-data/

                1. STOP_FORTH
                  Thumb Up

                  Re: Please stop everybody!

                  Thanks, I'll take a look at Ruby as well.

    4. Brewster's Angle Grinder Silver badge

      Re: Language question

      Does Forth not cut it? :P

      This is bread and butter stuff and I'd've thought any modern language will do what you want, if you understand how. So use the one you are most familiar with (does modern Fortran not do this?) as detailed knowledge of the language far outweighs the gain from shifting to one that makes it marginally easier. (I suspect knowing how to do this might be the real problem.)

      You can use 64 bit integers. But double precision floats can handle integers ≤ ±252 without loss of precision.

      If you really don't have a favourite l language, and C++ is off the agenda, I'd be tempted to point at node.js. (Downvote away.) It has a crude, record-free approach that feels like a throwback to the 8 bit micro days. Load a file with const fs = require("fs"), buffer = fs.readFileSync("filename"); After that const text = buffer.toString("ascii", byteIndex1, byteIndex2); will retrieve an ascii string between two byte indices. "ascii" can be changed to "utf8", "utf16le" and probably other well known encodings. If that's a hex value turn it into a number with let value = parseInt(text,16); The Buffer API lists methods that will allow you to peek at binary values by file index.

      Piece of piss.

      As others say, keep the data importing separate from processing.

      1. STOP_FORTH
        Happy

        Re: Language question

        I know FORTRAN has changed a lot but I really don't know how. Last time I used it we had batch processing and Hollerith cards (~1972-75). I have a lot more experience with BASIC and Assembler, but even that was 1980s-1990s. Although I have never really programmed in C, I am familiar with it as most of the technical specs I have read since about 1995 have consisted largely of pseudocode snippets which look a lot like C to me.

        Forth is actually not a bad suggestion, but it might be a little too low-level for this type of analysis.

        I'll have a look at node.js, I have heard of it but know nothing about it.

        I sat next to a UN climate scientist on a plane once. All of their programming is done in FORTRAN, probably a modern variant with lower case letters and numbers in the name!

        Thanks a lot.

    5. thames

      Re: Language question

      I would suggest using Python. If the data format is some sort of standard one in the scientific field, then chances are that there is already an open source Python library which imports it.

      If it's a custom format, then Python is pretty good at manipulating data. In particular have a look at using the "bytes" data type, as well as using the "struct" module.

      When you open the file, you are going to want to do line-by-line data processing or at least chunks of it, unless you happen to have enough RAM to hold it all at once. Python has a lot of options for this.

      I would suggest having a look at Pandas, which is a popular Python framework for handling large data sets.

      1. Doctor Syntax Silver badge

        Re: Language question

        " If the data format is some sort of standard one in the scientific field, then chances are that there is already an open source Python library which imports it."

        This where we came in. There was open source Python code to do what the chemists wanted, except that s didn't.

  12. m4r35n357 Bronze badge

    Not a single mention of testing . . .

    Well that is your problem right there.

    1. Doctor Syntax Silver badge

      Re: Not a single mention of testing . . .

      There's not a single word of not testing either. The original authors might have tested the life out of it - on the H/W & S/W they possessed. That doesn't mean they could guarantee to catch all corner cases on all other possible platforms including those yet to be designed.

  13. Cuddles

    Precision

    "For macOS Mavericks and Windows 10, the results were as expected (173.2); for Ubuntu 16 and macOS Mojave, the result differed (172.4 and 172.7 respectively). They may not sound a lot but in the precise world of scientific research, that's a lot."

    Depends which bit of the scientific world you're in. In the part I inhabit, being within an order of magnitude is often considered a good result. Having a variation of less than 0.5% is the sort of thing we can only dream of.

    1. Francis Boyle Silver badge

      Let's see

      According to XKCD#2205 you are a cosmologist.

      1. A.P. Veening Silver badge

        Re: Let's see

        And now as a link to make it easier for the rest.

    2. gnasher729 Silver badge

      Re: Precision

      Double precision has about 16 decimal digits precision. That’s plenty if your code managed to lose even six digits due to rounding errors. If your rounding error is 0.5% then your results are very dangerously close to being completely unusable. You lost almost 14 digit precision somehow when 16 digits is absolutely fatal.

  14. Cem Ayin

    Possibly OT: What the user really wanted...

    Sloppy coding, neglect to RTFM - it's business as usual in scientific research land XD (and you can't even blame the perpetrators, given that they have not normally been properly trained to code, even if coding happens to become their main task on the job; Luke 23:34 fully applies).

    But apart from the obvious fault I vaguely smell another issue here, which is all too common in scientific computing although the core of the problem is rarely fully appreciated: the lack of indexed files in modern OS'es. Yeah, I know, ISAM, RMS and the like are sooo seventies (at best), who on earth needs such a thing in this day and age? Nobody, right? Except, of course, when they do, such as scientists wanting to store data that does not fit into main memory when individual chunks of the data need to be accessed by a string-typed key, a problem which is quite common of course.

    Oh, yes, proper solutions for this problem abound, from the various dbm-derived key-value-stores around right down to HDF5, sure enough. Only, these libraries are not part of any standard API on POSIX-ish operating systems, their installation often needs to be requested through more or less official and less or more slow-to-respond channels or developers need to bundle their own copy with their code, plus each of these tools comes with a learning curve of its own - aye, there's the rub!

    Now what I daresay 8 out of 10 scientists REALLY do is this: they create a directory on a POSIX file system and within this directory they create a file /for each record/ of data, so they have easy access to each of them. The whole seven million or so. Never mind that this ingenious solution will bog down even the most performant parallel file systems when scaled even to medium size and stubbornly resist any attempts to back this mess up in finite time. That's the problem of the IT guys, right...? (Or rather, more to the point, the users in question are not normally aware of any gotcha lurking there. After all, what could possibly go wrong?)

    And all the user really wanted was a modern equivalent of an ISAM file in his OS' standard API...

    DISCLAIMER: I have not looked deeper into the library discussed in the article so there might really be a good reason why they store data the way they do and my comment /may/ indeed be OT here. But the problem I describe is real, and I know I'm not the only one fighting with it on a regular basis.

    1. Martin Gregorie

      Re: Possibly OT: What the user really wanted...

      Essy-Peasy solution: use any relational database. To do what you're asking only needs a single table with a prime key. The software is free too: anything from MariaDB, Derby or SQLite to PostgreSQL.

      The latter would be my choice: dead easy to install, once set up needs next to no care & attention and, unlike the others I mentioned, would be easy to use as a shared resource for several projects.

      1. ibmalone

        Re: Possibly OT: What the user really wanted...

        At the risk of making people wince, HDF5 is not a million miles from a database format, and designed for this kind of use (mainly the vast dataset access). It's also designed for supercomputer use (if that's relevant to the task), while traditional databases are around a server model, which may be much more troublesome for high performance tasks (at the big end) or processing stuff on your laptop (at the little end). A database is a good place to park data, we do actually store analysis results in a RDBMS, but we don't store or process images in a traditional database because it's not a suitable way to work with them. (Or in HDF5 normally for that matter, interoperability with existing tools is an important consideration.)

        That said, HDF5 still doesn't solve all your problems of arranging and structuring your data for you, or the endless task of collecting, collating and converting data, and interfacing tools that want to work with different formats or fundamentally different understandings of the thing you're analysing.

  15. Alan Bourke

    Not really Python's fault then.

    Was it.

  16. shufflingB

    The "but it works on my machine" problem, gimme your container

    As anyone who has worked with "build systems" will know, this is just another instance of the "but it works on my machine" problem. Software has a life of its own, in most places reproducibility is poorly understood and a low priority. Right up until a critical bug is discovered by a customer that needs fixing years after release, or if really unlucky, it ends up killing someone.

    The fact of the matter is that the standard platform tools available to devs, be they brew/apt/pip/cpan/apt/npm/rpm/gem etc etc, prioritise ease of use and are all dependendent on the the unknown quality and consisteny of the artefact recipes they use. Aside from that, none of the tools stop the dev's artefacts from being reliant on using undocument features or tools on their system (set PATH=${PATH}:~/joeblogs/bin anyone).

    It doesn't matter what type of devs we're talking about. Capturing dependencies for the creation of any software artefacts that is close to 100% reliably reproducible is really, really, really difficult, time consuming and error prone. To think otherwise is happy path delusional.

    For this reason long term support of safety critical sw artefacts often involves sticking production machines in a cupboard (or the virtualised equivalent thereof). Testing's part of the solution. But testing is finite and as fallible in the rest of it. Personally, given the science's reproducibility crisis and computational assisted mathematics proofs, I would think supplying containers/images should be a mandatory part of the publication process.

  17. Paddy
    Facepalm

    Onward and Upward!

    They have moved on from the problems of a couple of decades ago when C and Fortran ruled.

    Why did you expect that allocated memory to be all zeros, again?

    You only ever used 5 letter names, which dumbo would have tried to use more than 10?

    Are you sure <insert complex pointer arithmatic> does what you think it should?

    Progress, yah!

  18. hellwig

    Don't use kids toys for adult sciencing

    The scripts, described in a 2014 Nature Protocols article, were designed by Patrick Willoughby, Matthew Jansma, and Thomas Hoye to handle nuclear magnetic resonance spectroscopy (NMR), a process for assessing local magnetic fields around atomic nuclei.

    Ah Python, your simplicity bites another sucker in the ass. I can see the program now:

    import science

    science.run(".")

    I love how the science can't be done without computers, but they didn't really bother to learn much about said computers. If nothing else, understand that something written on Linux or BSD that involves the file system will most likely not be directly portable to Windows. Heck, I know when I look over the Perl documentation, it makes this very clear.

    1. ibmalone

      Re: Don't use kids toys for adult sciencing

      Python docs, first sentence: "The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order."

      Seems relatively clear. Python is used as a teaching language, but that doesn't make it a toy, or unsuitable for actual work. (From someone who writes mostly in Perl...)

      Anyway, the arcana of system implementations should not be required reading to understand the biochemistry of cyanobacteria, if people need to know that stuff while performing increasingly sophisticated domain-specific tasks then there's something wrong. And I don't think the situation described really makes that point either; it's a more general issue about checking the files being read in, which could be screwed up by all kinds of things.

  19. disgruntled yank

    Also

    What did the function add above a direct call to glob.glog ?

  20. arctic_haze

    Things like that do happen

    I have a similar difference in results many years ago when I ran a C++ written Monte Carlo code on Windows and Linux. The culprit turned out to be the random number generator. It was a lesson I learned and since then I've been using an OS-independent random number generator function embedded in the code.

    1. Paul Kinsler

      Re: The culprit turned out to be the random number generator.

      I had a case once, where some stochastic simulations depended on their evolution on the square root of a field. Occasionally, during convergence checks, trajectories would diverge violently from each other, and sometimes therefore show up as suspicious differences in the averages or variances (particularly because during convergence checks I wasn't using large ensemble sizes).

      Turned out that sometimes tiny differences between the integrations in the two cases put the next values on opposite sides of the square root's branch cut, meaning that the next random kicks went in opposite directions. That one took quite a while to track down, especially since it was both rare and intermittent.

  21. Anonymous Coward
    Anonymous Coward

    Assumptions (again)

    for file in glob.glob('*.out'):

    list_of_files.append(file)

    *

    Shouldn't this look like:

    for file in glob.glob('*.out'):

    list_of_files.append(file) # Assumes the returned list is in <assumption here> order

    *

    This at least would give users (and testers!) something to work with.

    *

    Yet another warning about programming and assumptions (if we needed another one!).

  22. Ken Moorhouse Silver badge

    Surely the solution is an explicitly defined index...

    A list which constrains the order of processing?

    This should not be news to anyone familar with SQL. The engine optimises the order in which processing is carried out, unless an index is specified, the output order is entirely at the discretion of the engine.

  23. David Gillies

    Don't go exploring in academic codebases without a pump-action shotgun and a torch

    The overall quality of code in academic STEM disciplines is hair-raisingly awful. I used to get called in on occasion to help out a colleague with a particularly refractory bit of code. I think my greatest win was applying a bit of algorithmics and dropping the complexity of a routine from quintic to linearithmic which in the problem space where it was running represented five orders of magnitude speed-up (which is a month to 25 seconds).

  24. Anonymous Coward
    Anonymous Coward

    Mission creep

    I suspect part of the problem here is mission creep - it certainly happens in my environment.

    A researcher knocks together a bit of code to run a particular experiment on their computer, or a proof-of-concept program.

    Being a quick hack it's written on a 'works for me' principle for the particular O/S and carefully curated file set the researcher has, without considering portability or too much error checking.

    The manager sees it, says 'that's cool, but can it do this too?' and before you know it your quick hack has grown into something vital for the lab., gets used for data preparation for journals and gets published 'as is'.

    Of course, when it turned from being a quick demonstrator to a piece of vital infrastructure it should have been rewritten from the ground up, but there's always pressure to do science too and, what's that? A professional developer wants to be paid to do it?

    1. Doctor Syntax Silver badge

      Re: Mission creep

      A glimpse of reality. This is how stuff happens. The first and maybe only version of the code is written for a single, constrained situation. My first FORTRAN program (the initial attraction was "You mean I don't have to do my own arithmetic any more?!") was to take readings from levelling (did you know there were surveyors staffs calibrated in decimal feet) and metric measurements from the Hiller borer and print out lists of numbers for me to draw stratigraphic sections by hand. Not too many people were going to want that one.

      I admit to a certain amount of conflict over this one.

      As a sometime scientist the computer was to be used as a tool like any other instrument such as a microscope. However you should know your tools, e.g. if you're a microscopist setting up the Kohler illumination should come naturally to you. If you're going to write programs to do a job you should take care over them.

      OTOH a high level programming language should provide a high level abstraction of the platform and it might be reasonable to expect the provision of a view of directory contents consistent from platform to platform to be a part of that abstraction.

  25. martinusher Silver badge

    Numerical calculations have limits

    If you need to perform calculations reliably -- that is, without encountering problems with overflow or underflow or losing precision due to rounding errors -- then the only way to do this is with BCD arithmetic. This effectively treats numbers as strings and allows for effectively infinite precision. There is a significant performance penalty so the best use for this would be to run quality control tests on more optimized code. A big part of designing numerical calculations is understanding and controlling overflow, underflow and rounding -- rounding in particular causes all sorts of precision problems.

    I should take issue with the notion that "different OSes cause errors". The operating system has absolutely nothing to do with the behavior of a calculation. Its the libraries attached to the language (which in turn might use libraries for system languages like 'C') that dictate how calculations are performed and how the underlying hardware is used. These libraries could -- should -- be replaced with ones that are purpose designed for the level of precision required.

    1. ibmalone

      Re: Numerical calculations have limits

      The real numbers beg to differ.

      BCD cannot help you with cosine(1) or sqrt(2).

  26. Anonymous Coward
    Anonymous Coward

    Computer scientists not immune...

    A few years back I joined a Python project developed by two masters of computer science. Not long after I was off troubleshooting a user with an issue that turned out to be due to an unsorted dictionary - as any Python 3.5 user should know, dictionary order is not guaranteed. Since the masters couldn't reproduce it - unit tests and all - they were completely confident of PEBCAK. Even once I explained the problem and fix I think only one of them actually understood, the other was too irritated that I was impugning his code to believe either me or the docs.

    In the end I 'just' migrated the whole org to Python 3.6 - it was only a couple hundred users and I wanted some other 3.6 feature anyway, so the happy luck that dictionaries stay in insertion order for >=3.6 was convenient. Both masters left of their own accord shortly after I became the manager - as a flawful programmer myself I don't hold mistakes against people but seriously, show some humility.

    (anon to protect the guilty, not that I think our situation was particularly unique)

    Beyond that, the thread leaves me a few thoughts:

    - you lot who have problems developing large projects in Python, you are doing it wrong. Apply decent software development principles and it is no more complicated to create, maintain, and support than a well-developed project in any other language.

    - just because untrained devs can use Python, doesn't mean you must be an untrained dev to use Python. If it's the right tool for the job, use it, and if it's not then use something else. Write the code as best and as organized as you can regardless of tool, you and/or your replacements will be thankful (in between cursing your youthful ignorance).

    - I love testing, but the fact is that tests are usually run in a particular kind of sandbox - you aren't likely to get different results due to memory constraints, or file sorting errors, or dictionary insertion, or any other apparent randomness that will affect users. A good tester can deliberately introduce these conditions, but it costs a lot and in my experience schedules truncate it most of the time.

    - Finally, is there some kind of clearinghouse for research software? Strikes me (non-researcher) as odd that a formal paper was published for a relatively minor and to be honest fairly obvious code bug, and then picked up by media - seems like the kind of thing I use an issue tracker to manage. Is that how researchers collaborate on software normally?

    1. Doctor Syntax Silver badge

      Re: Computer scientists not immune...

      "- just because untrained devs can use Python, doesn't mean you must be an untrained dev to use Python. If it's the right tool for the job, use it, and if it's not then use something else."

      If untrained devs are expected to use Python then maybe Python shouldn't bite. And don't expect them to know if it's the right tool for the job or what the right tool might be.

      1. A.P. Veening Silver badge

        Re: Computer scientists not immune...

        If untrained devs are expected to use Python then maybe Python shouldn't bite.

        I think it is time to develop a new programming language and name it Cobra or Black Mamba, just as a warning it will bite if handled improperly.

        1. ibmalone

          Re: Computer scientists not immune...

          Imperative Execution Description, IED.

        2. Doctor Syntax Silver badge

          Re: Computer scientists not immune...

          Nice to know somebody noticed.

        3. Ken Moorhouse Silver badge

          Re: name it Cobra or Black Mamba

          Adder would be more appropriate, it being humbler regarding its likely limitations.

          1. A.P. Veening Silver badge
            Devil

            Re: name it Cobra or Black Mamba

            There will be no limitations regarding poisoned output

  27. Bitsminer Silver badge

    Best science software request

    Back in the days of HP coloured pen plotters, one day a nuclear physicist asked me for a pointer to a "double precision plotter library.". Seems he was having trouble with expectations vs reality.

    After thinking quietly to myself about a suitable response to a future request for a "double precision pen plotter", I asked him what formula he might be trying to get a paper graph of. Then suggested he rephrase his problem in the form of a Taylor series. He got the idea immediately. Two problems solved! No software was libelled in the telling of this story!

    1. ibmalone

      Re: Best science software request

      I suppose the problem would be where to get the (glances at back of envelope) A-12 size paper.

      1. This post has been deleted by its author

        1. Simon Harris

          Re: Best science software request

          Aminus12, on the other hand, might pose other logistical problems, being about 76x54m in size.

          However, even at this size, even a 24-bit mantissa would give a resolution of about 4 microns over the whole area without loss of precision.

          My finest drawing pen has about a 100 micron tip.

          1. ibmalone

            Re: Best science software request

            Hmm, I think my problem was in converting to hypothetical DIN A sizes. Taking ~ 1/16mm precision, and A4 being ~ 2^8mm across, that's 2^12 bits for location in A4, so need to be 2^(24-12) times as big for it to be worth using something more than 32 bit float (24 bit mantissa). So I just subtracted 12 from 4, but this was a mistake since it's only one side that doubles at a time with A4 sizes, so I should have subtracted 2*12 from 4, A-20.

            (On closer investigation there are two 'negative' DIN A sizes, 2A0 and 4A0, each represents a doubling in one direction, 2A0 is what you'd logically imagine A-1 to be, 4..A0 would be A-2. Rather boringly then following that logic the plotter would need 2^8A0.)

      2. This post has been deleted by its author

  28. This post has been deleted by its author

  29. razorfishsl

    PFFFFFFFFF.

    Biologists think they are good coders purely becasue they deal with DNA.....

    Far too many so called "programmers" these days are loose cannons as regards to the code they write,

    I have even seen people who are completely unskilled in data storage & computer programming instruct programmers on how to handle data manipulation:

    becasue

    1."computer code & programming is just like English"

    2. we will never want to do that with the data, so just "strip the type" and store it as a string.....

    1. fajensen
      Facepalm

      2. we will never want to do that with the data, so just "strip the type" and store it as a string...

      Often experienced 'in the wild' within code that uses double-barrelled abominations: MySQL and Java!

      On top of that pile there will be some kind of 'string-to-object' mapper/converter library with a side dish of functions re-implementing most of the search, selection and ordering logic in an OO-kosher way (that the Database (even MySQL) could also do with SQL - if only it knew the meaning of the data and OO-Purity was not enforced)!

      ----

      MySQL didn't really use nor enforce data types in 'the early years', leading to much mischief by developers that should have used a better database engine instead of hacking around MySQL flaws and quirks.

  30. jjq

    same old mistakes

    As already noted, the quality of software in the scientific community is generally lamentable; little or no design and poor execution. In this regard, Les Hatton's ``The T Experiments: Errors In Scientific Software'' is worth a read: https://www.leshatton.org/IEEE_CSE_297.html . My own scratchings include https://books.google.com/books?id=OOgBQ97VrpIC&pg=PA3&lpg=PA3 but the text would need to be read to appreciate why I'm not engaging in self promotion.

    Last week's Brexit meeting between Johnson and Varadkar, at Thornton Manor, took place five miles from where I watched the moon landing, aged 7. Aged 29, I made it to NASA Langley where Armstrong practiced his moon landings; with a fresh PhD in Aeronautics and a large piece of software I'd written for my thesis. That was 1991. Fast forward to today, with the intervening rise of ``computational science'', the situation for many grad students/post-docs is dire. For they are expected to work with large, legacy codes that are often simply not fit for purpose.

    Thus while the present article is to be welcomed, it falls at the superficial end of the spectrum in terms of highlighting computational problems in the research world.

  31. matthewdjb

    Order!

    Unless the type of the returning parameter explicitly specifies the sort order, the sort order of the returned data is undefined.

    A very important basic fact.

    I am astonished at the number of people who think that an SQL SELECT will return records in the primary key order. In a relational database, the sort order is undefined.

    I don’t know, apathetic bloody chemists, I’ve no sympathy at all.

    1. ibmalone

      Re: Order!

      I don’t know, apathetic bloody chemists, I’ve no sympathy at all.

      I don't think that's core python, maybe an add-on.

  32. JBowler

    LC_COLLATE

    https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html

    This is a problem I have encountered several times in the past while doing data processing of large data sets contained in multiple files; it's a common scenario in many activities were data is collected over time then analysed later.

    Yes, the algorithms which analyse the data most certainly should not produce **significantly** different results depending on the order the input data is processed, but they always do produce **different** results because of rounding errors in the floating point arithmetic that is used. Forcing a sort order is really just hiding an underlying problem and, given that the things being sorted are textual names of files, it should be apparent (with a little thought) that the order is going to be language specific:

    https://docs.python.org/3/howto/unicode.html#unicode-filenames

    https://docs.python.org/3/library/os.html#os.listdir

    What needs to happen is one of two things:

    1) The data files are themselves ordered. Then there should be a separate file listing the order and that file should be read to find the names of the files with the data and (implicitly) the order in which to read them. An alternative is to encode the order in the file name, but that should be documented in both the code and, textually, in separate instructions for people who add to the data. I routinely use ISO dates or data/times to do this (e.g. 20191018, 20191018.1754 etc.)

    2) The data files are not ordered. Then the code should be tested with data files in different orders. The way I do this is to randomise the read order so that every run reads in a different order. It's immediately obvious then if there is an instability or bug in the code!

    In both cases scientists should always produce error calculations. Sadly very few do. There are two ways of doing this:

    1) Regular error analysis. I was taught this the first year in university; the Physics department felt it was a lot more important to teach generally applicable scientific methods than any physics.

    2) Interval arithmetic. This is particularly appropriate to deal with the errors introduced by floating point rounding in computer systems:

    https://en.wikipedia.org/wiki/Interval_arithmetic

    https://pypi.org/project/pyinterval/

    https://arxiv.org/abs/1611.09567

    For science either can be used but interval arithmetic deals with unstable calculations better; you tend to end up with an interval containing an infinity or a NaN, which makes the problem very obvious.

    John Bowler

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like