back to article Facebook boffins bake robo-code converter to take the pain out of shifting between C++, Java, Python

Facebook researchers have applied recent advances they've made in the unsupervised machine translation of human languages to a source code conversion system. In a research paper recently distributed through ArXiv, boffins at the beleaguered ad biz describe a project called TransCoder. TransCoder is a transpiler, also known as …

  1. martinusher Silver badge

    Coversion costs?

    The way this is written implies that the $750 milliom spent converting a code base from COBOL to Java was merely line by line translating of the source. I don't have any knowledge of this project but based on my experience the transation would be taken as an opportunity to upgrade the systems and its functionality. That is, the 'conversion' was really a system rewrite with all the complexity in design, coding, integration and testing that would entail.

    1. coconuthead

      Re: Coversion costs?

      Indeed, my observation as a customer of the bank is that the systems have gone from processing a lot of stuff at night for availability the next day to real-time. Their marketing now emphasises some of these features.

      There also used to be too many Byzantine procedures which required interaction with a human at the bank, e.g. by phone. These days they only seem to want to talk to you to avoid fraud, overextension (which they are legally required to do) or to otherwise satisfy regulation. Given the size of the bank, it may have been rational to get the conversion done sooner rather than later and cheaper, on the basis of staff costs alone.

    2. J27

      Re: Coversion costs?

      The author is clearly not familiar with the process of porting code between languages/frameworks.

  2. Mike Shepherd

    "faster and...more maintainable"

    TransCoder...could help port a project from Python to C++...It may make the code faster and also more maintainable since code written in strongly-typed languages can be easier to understand.

    Get real. Anyone who's maintained human-written code will have experienced wanting to shake its author by the throat, because typical source code is of abysmal quality. We're a long way from any machine that will improve on that. So don't ask me to work on TransCode's output, because it will have to do a lot more than change x=2 to int x=2 (even if it gets that right) to compensate for the general mess with which it will likely start and to which it can only add.

    As for speed, there has been very little code worth speeding up for the last 40 years or more. The overwhelming problem, then and still now, is how to write clear and reliable source that reflects the requirements accurately. Making code faster is appropriate in niche cases, which most of us need rarely to address.

    1. Def Silver badge

      Re: "faster and...more maintainable"

      I don't think I can name a single application I've ever used (or a lot I've worked on for that matter) that wouldn't have been better if they'd been faster.

      Making code faster is only becoming a niche as higher level languages take more control away from the programmer, and as fewer and fewer programmers really understand the performance consequences of their choices and know how to optimise their code.

      1. RM Myers Silver badge

        Re: "faster and...more maintainable"

        Wow, two comments that I can both agree and disagree with, at the same time. Having designed and helped write a series of programs almost 40 years ago which saved over $300K per year in processing time, with probably less than $30K in programming time, I definitely disagree with the "no program in last 40 years comment". Plus, having been involved with systems that had thousands of users at my former employer, I can tell you performance can be critical. We had several large projects that were never implements because the systems were too slow.

        At the same time, there were many times where performance was much less important than maintainability, both from a quality perspective and overall cost. When a limited number of management people are making long term billion dollar decisions based on reports and analysis from your system, being fast is much less important than being accurate. And senior management doesn't tend to like it when small changes take months to implement, so lack of maintainability can be career threatening.


        1. Def Silver badge

          Re: "faster and...more maintainable"

          I never said anything about sacrificing maintainability for performance.

        2. big_D Silver badge

          Re: "faster and...more maintainable"

          Except optimization and readability / maintainability are not mutually exclusive. You can optimize maintainable code and still have it be maintainable.

      2. Rafael #872397

        Re: "wouldn't have been better if they'd been faster"

        Back in the Old Days, I worked on developing dBase/Clipper applications in DOS (ask your grampa) for a local retail shop. We also had a Turbo Pascal program that printed some random numbers, a simulated countdown and a message "Don't Stop Me - Reindexing and Optimizing the Database".

        Of course it was just what we used for a "coffee"-break. And we worked hard on making it slower.

        1. Anonymous Coward

          Re: "wouldn't have been better if they'd been faster"

          You want to admit to that?

          I bet you also played on a CP/M machine too.

    2. big_D Silver badge

      Re: "faster and...more maintainable"

      Niches, such as desktop computing or web applications... Yes, very little need for optimization there.

      As someone who spent a lot of their programming career firefighting poorly optimized code, I can tell you that code optimization is very important and optimizing code and writing human readable code are no mutually exclusive, just that human readable doesn't automatically make it fast to execute.

      I've worked on desktop projects that, for example, have reduced the runtime of a financial data collection system from 22 hours to 2 hours on a PC, multiply that up by hundreds of accountants around the world and the fact that it locked the whole PC up for those 22 hours, so they couldn't do anything else, the time invested in optimizing that paid for itself within a month.

      Likewise, I've worked on web projects, where the load balanced servers and back-end database would collapse under the load of 250 simultaneous transactions, after a few hours of optimization, the same server configuration didn't break a sweat with over 1,000 simultaneous transactions. Those few hours of work were a lot cheaper than throwing 4 times the hardware at the problem.

      I can give dozens of other examples.

  3. HildyJ Silver badge

    Based on language translation

    It's one thing if you translate "I seem to be having this tremendous difficulty with my lifestyle", into a Vl'Hurgs and I it comes out as the most dreadful insult imaginable.

    But if you're dealing with code, less than 100% accurate translation will cause errors which may be catastrophic and will be difficult to track down. Having a neural network do this instead of programmers just makes it faster to generate errors.

    1. Anonymous Coward
      Anonymous Coward

      Re: Based on language translation

      "The generated functions and production code have to be tested; they are not guaranteed to be correct."

      Hey, you did have tests for the old code, that you can run against the new code, right? Cuz otherwise who can tell if it's code... or garbage?

      But maybe the new code won't be testable if "... accuracy, with 74.8 per cent and 68.7 per cent in the C++ → Java and Java → Python directions ..." . That sounds more like "Here's what the de-obfuscator produced - now make it golden. Only a couple minutes per function, right? All done today?"

      Yet elsewhere in the article there's a bit of wonderful saying "... that accurately translates functions ..." How do you get that from the claimed 74.8% "not trash"? Or as I read it, that's one quarter crap.

      1. Warm Braw Silver badge

        Re: Based on language translation

        The generated functions and production code have to be tested; they are not guaranteed to be correct.

        That's not generally a characteristic you would welcome in a compiler. It's not as if there isn't open source code available* for parsing and lexically analysing COBOL and Python, so you just need to glue a code generator on the back end for the target language. You might not get a result that you can visually associate with the original (though you could put that in comments), but at least it would be functionally correct^.

        I'm not quite sure how an enormous amount of effort to produce a flawed AI solution + unknown effort to correct the result actually saves time and money. Particularly when elderly code has a habit of working in mysterious and undocumented ways - usually the main reason it is preserved.

        Edit: And given the world is supposedly moving towards container-based microservices, why convert stuff anyway if it's working?

        *With the exception of proprietary language dialects.

        ^Assuming conforming data types for source and target

        1. thames

          Re: Based on language translation

          As I understand it the main interest in converting from COBOL to a more recent language is due to the limited and shrinking pool of COBOL programmers available.

          The old code can be kept running, but as a business changes and evolves, the software must evolve along with it. It's hard to find programmers who want to learn COBOL, because specialising in that field limits themselves to maintaining old COBOL programs, and that sort of work can be feast or famine.

          The interest in converting from Java to some other language (e.g. Python) is driven by a determination by some companies to get as far away from Oracle as possible. It's all down to Oracle being Oracle, rather than anything about the language itself.

          In either case if the original program had a really good set of unit tests, then you could run the original source code through a translator, run the unit tests, fix what fails, and you would have a working system. You could then gradually re-write the machine translated code to something more understandable by humans to make it more maintainable.

          Unfortunately, those sorts of programs rarely if ever have any unit tests at all, let alone have comprehensive ones. That means the first thing you would need to do is analyse the original code and write unit tests manually. And if you're going to do that, you may as well just re-write the source code by hand while you're at it.

          1. Anonymous Coward
            Anonymous Coward

            Re: Based on language translation

            "... run the unit tests, fix what fails, and you would have a working system."

            It's a nice idea, but unit tests (specifically, as opposed to other levels of testing) tend to be written in the same language as the actual code. Which means you also need to convert the tests - if they remain in the original language then they probably can't exercise the new code as older systems tend to be less flexible in terms of what they can call.

            So how do you test that the tests have been converted okay? (Recursion! Who watches the watchers?)

  4. Anonymous Coward
    Anonymous Coward

    Can’t wait

    Looking forward to the logical endpoint of this work: a ML system which, shown images of a language manual and an ISA manual, generates a compiler.

    1. Anonymous Coward
      Anonymous Coward

      Re: Can’t wait

      If the manual contains an image of the language syntax in BNF or similar, we're nearly there (see Bison, Antler, etc) ...

  5. Lunatic Looking For Asylum


    Struggling to see why you would need something like this. If the code is good and performant, why would you want to convert it. If it's crap then what comes out will be crap as well and you may as well revisit the design and do it better (note I didn't say properly) in your language of choice.

  6. thames

    Cython, Numba, etc.

    If the objective is performance, there are already multiple solutions for converting Python source code to C or C++. Some are domain specific and intended to just convert a few functions, but some like Cython will do the entire program (to C in this case). Cython has been around for years and is widely used.

    The thing is that there is little or no real demand for converting Python to C or C++ and then throwing away the Python and using the C or C++ as the new code base. Things like Cython are used by people who want to write code in Python and compile to machine code. The C or C++ are just used as a form of intermediate code in the compilation step, and the user doesn't normally look at it.

    In certain particular applications, the resulting binary is faster than the Python version. However, in other applications, the C or C++ actually runs slower than the Python version. This is why the technique has specific applications (such as numerical algorithms) instead of everyone using it on everything.

    The subject of converting Python to C or C++ has been very thoroughly researched by multiple people, and the conclusion that everyone seems to reach is that while it can work in some cases, in order for a static language such as C or C++ to be able to do everything that a dynamic language such as Python can do, it would have to incorporate the equivalent of a Python interpreter inside it.

    And indeed this is what Cython actually does. It does what it can in C, but it also calls back into the Python interpreter to do many things. The overhead involved in this means that the translated C program can be slower than the interpreted Python program.

    The way these tend to get used in practice is that people benchmark their Python programs, identify bottlenecks, and if they are a good candidate for using Cython, write just those parts in Cython and call them as a function. It's basically used by people who want the equivalent of C extension (many Python libraries are actually written in C) but don't want to write it in C. You can gradually add code hints for the Cython compiler to help it generate better C code.

    So, while this may be an interesting research project, it's not introducing anything revolutionary.

    1. Anonymous Coward
      Anonymous Coward

      Re: Cython, Numba, etc.

      "In order for... C or C++ to be able to do everything... that Python can do"

      Umm, do you to take a guess at what static language Python is written in? Everything Python can do by definition C can also do. And FWIW modern C++ can do it just as easily and with greater functionality.

      1. thames

        Re: Cython, Numba, etc.

        Yes, Python is written in C. And to do everything that Python can do, a C or C++ program would have to include the Python interpreter in it. C and C++ just don't have the language semantics to do everything that Python can do without duplicating the Python run time, which effectively embeds Python within the C program (embedding Python inside C or C++ programs is something that is done by the way).

        There have been numerous attempts to create static compilers for Python by translating the Python source code to C or C++ but without relying on the Python run-time to be present (Cython relies on it). In every case the authors get about 90 to 95% done before running into a brick wall and giving up. If it was possible it would have been done by now.

        1. Anonymous Coward
          Anonymous Coward

          Re: Cython, Numba, etc.

          So C cant do what Python can do even though python is written in C?


          Are really this dumb or are you just having a bad day? Either way you clearly dont have a clue what you're talking about and I suspect you're just parroting stuff you've read without understanding it. I suggest you educate yourself about how interpreters and compilers work.

          What gem will you come up with next - that assembler cant do everything either and would need Python embedded in it in order to generate the assembler required to carry out Python functionality??

        2. Tim Parker

          Re: Cython, Numba, etc.

          I think you're getting confused about the difference between the functionality of a program written in a given language and an interpreter for a language. Also the role of the embedded interpreter in a compiled Python program. You do not need an embedded Python interpreter in a program translated from Python to C/C++ (or other language) unless you are deliberately using some specific Python runtime feature, e.g. Idle, nor the Python runtime.

  7. Anonymous Coward
    Anonymous Coward

    Does I error if it cant do it?

    Or does - as neural nets have a habit of doing - just produce output regardless or whether its correct? For example if you had some C++ code that had it's own complex non STL data structures that involved raw pointers pointing to memory mapped files, shared memory or similar, you simply cannot do a line for line translation in Java or Python, it wont work. You have to understand the code and for want of a better word , paraphrase it or even do a ground up rewrite if the language you're translating to doesnt support the functionality.

  8. RichardEM

    from Intel to ARM

    I wonder if this type of translation could help in Apples move from Intel based Macs to ARM based Macs?

    1. Richard 12 Silver badge

      Re: from Intel to ARM

      Only if the set of goals does not include "actually works"

      This is an interesting experiment, and I'm sure much will be learned from it that can be applied to other problems.

      However, it will not be directly useful in of itself within the next 20 years, if ever.

  9. Robert Grant Silver badge

    Time to test it

    on the Facebook codebase, then? PHP to Rust, perhaps?

  10. YetAnotherJoeBlow Bronze badge

    Even better

    Now for something really useful - Java to C. Ditch all the frameworks.

  11. Anonymous Coward
    Anonymous Coward

    I worked on a system that had beens largely written in Pascal and converted to C with pascal2c. While the converted code compiled and ran, it was impossible to maintain in its C form. That left us tacking on new functionality around the edges but having little or no interaction with the core code. I'd like to know how this FB system with 75% reliability that the conversion even wirksis any advancement.

  12. John Smith 19 Gold badge

    "they are not guaranteed to be correct.""


    Is that better than some s**t flinging code monkey cut and pasting off stack exchange?

    So let me see if I got this straight.

    It takes a function (which is presumed to be 100% working) and maybe converts it to one that has 75% chance of working, but could be f**ked up in some way ?

    One of the usual suspects (EDS perhaps?) was doing something like this where chunks of COBOL were emailed to their mainframe and chunks of C came back. These are big applications. I'm very unconvinced a bunch of ANN's is going to cut it for this. Computer languages <> human languages. 99% of human language ambiguity is designed out from day one to ensure they can be compiled.

  13. Spoonsinger

    I can just imagine the joys...

    of maintaining a set of source which has been generated from a totally different language. Glad it's not going to be me,

  14. John Smith 19 Gold badge

    Back in the day the sign of a good CASE tool was...

    If you could make all the manipulations you needed without digging into the code the system generated.

    I worked with 2 such systems.

    One never touched it. No problems.

    Other. Some nappy had bought the code generated by the CASE model rather than the model itself.

    It was f**king horrible.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022