back to article ARM's new Cortex-A35: How to fine-tune a CPU for web browsing on bargain smartphones

ARM, designer of smartphone brains, will today reveal the Cortex-A35: a processor core subtly tweaked to run mobile web browsers and similar apps faster. How can a CPU be tuned for something seemingly so specific? The answer lies in the A35's instruction prefetch queue. ARM has halved the length of this queue in an attempt to …

  1. Mage Silver badge

    Too big a queue and pipeline can slow down

    If the program has a lot of branches/conditionals already fetched, then stuff gets discarded and it can be slower than a short queue.

    1. A Non e-mouse Silver badge

      Re: Too big a queue and pipeline can slow down

      Wasn't that the problem with the Pentium II?

  2. Charles Manning

    Good thing ARM is a fabless company

    Brits are no good at shovelling electrons.

    Just look at the British motor vehicle industry: destroyed by Lucas electrics.

    1. smartypants

      That destroyed motor vehicle industry in full...

      According to the SMMT, by 2020, it's forecast to build a record 2 million cars a year (currently 1.5 million)*


    2. Anonymous Coward
      Anonymous Coward

      Re: Good thing ARM is a fabless company

      >Brits are no good at shovelling electrons.

      "Shovelling electrons", you say?

      If that's an indication of your technical knowledge in the area of solid-state microelectronics...

    3. Anonymous Coward
      Anonymous Coward

      Re: Good thing ARM is a fabless company

      "Brits are no good at shovelling electrons."

      First radar, television, digital computer, particle accelerator... guess whom & where.

  3. CheesyTheClown

    Ok... How exactly will this work?

    This sounds a lot like the silliness that was passed off by Symbian back in the day with regards to clearly not understanding web browsers. If you want to improve web browser performance a prefetch cache or pipeline like this is utterly useless for so many reasons I can't even count. I can tell you definitively as a web browser developer with years of experience optimizing for embedded platforms often based on ARM that this approach is just senseless. If anything, it is actually damaging based on new emerging JavaScript technologies.

    Yes, browsers have many streams, but this plays such a pathetically small role in the browser process that it means nothing to optimize the streams as such. Things which do help of course in such a manor would be hardware accelerated bit stream handling for JPEG, GIF and MPEG decoding. Though MPEG generally already is managed by hardware codecs. This however will still have little impact.

    Browsers are a complicated thing. Most time processing in a browser is based on indeterminate code. Meaning that most of the time sucking processes are related to items dictated by the page authors. The network can be a minor issue as well due to packet fragmentation and the indeterminate packet size where TCP streams are involved. Most modern browsers use almost push oriented stream parsers that trigger reflows of the document structure as more tags are encountered. More smaller frames cause more document adjustment which causes more redrawing and general paint cycles. Larger frames decrease user experience because they take longer to process and transfer.

    Let's also consider the current state of JavaScript. Code downloaded and compiled will suffer badly from long pipelines as the compiliation process has no predictability and branches very heavily. It would be nearly impossible to create long chains of instructions without branches. Very often the code in JavaScript must be re-JITed as JavaScript is such an evil language as such. Add the other factor which is that in most cases tracing JITs never worked as they simply require too much time to be spent simply calculating metrics of such traces that the longer pipelines failed to increase performance to compensate.

    Let's also add that tweaking memory access as such on a relatively obscure platform means that in order to implement lock-less inter-process/thread JavaScript shared memory won't happen since developers already are stretched implemented for the more mainstream cores. The longer pipeline will increase the complexity of such operations so greatly that in a multi core environment, JIT developers will be forced to lock.

    There are hundreds of operations which a CPU designer could provide a browser developer that would cost little in transistor count that would make real performance improvements without playing stupid games like this. Want the #1 bread winners in my opinion?

    1) MMU enhancements for decreasing GDT and LDT complexity as browsers are evil memory fragmentors to begin with. Operations to assist with page reallocation and defragmentation would do absolute wonders.

    2) Faster byte oriented operations that decrease alignment oriented issues. Browsers almost always handle data one byte at a time.

    3) Hardware Regex DFA engine. Implement a 256 cell 16-32 bit memory grid of direct address only non-DDR RAM for accelerated regular expression lookups and handling. Add a few extra instructions allowing this to be easily pushed to cache and loaded back. This will probably add major improvements.

    I could think of a few dozen more... But they'd be useless unless someone downloaded Chromium and actually implemented the code to handle this. Of course modeling the instructions in System-C and integrating the in a browser to identify their benefits would be smart too.

    1. Charlie Clark Silver badge

      Re: Ok... How exactly will this work?

      Methinks you're overinterpreting the changes and expected performance improvements.

      It looks like the simplification of the architecture leads to a 20% improvement over the previous generation. The real boost comes from, surprise, surprise, boosting the clock speed. This might be enough in the cut-throat section of the market.

      As for your suggestions: who's saying they aren't available (for a price) from ARM? Or haven't been added by some makers to their own chips? I think this is the difference between the ARM and Intel value proposition.

    2. Paul Shirley

      Re: Ok... How exactly will this work?

      Claiiming it can't work because browser code is branch heavy? To me that suggests frequent queue flushes and more memory bandwidth wasted on abandonned op fetches. Reducing the waste with shorter prefetch seems worth trying and i doubt they bothered licencing till benchmarking showed a real improvement.

      However counterintuitive it seems to you, trying to second guess the complex interactions at this level is more magic than science and you'll always be surprised, however long you do it.

  4. Anonymous Coward
    Anonymous Coward

    0.4 sq mm

    That's a 64 bit cpu. When I started playing with electronics, a single transistor was a lot bigger than that.

    It's almost as bad as trying to get my head around Avogadro's number in 0 level chemistry.

    1. Martin an gof Silver badge

      Re: 0.4 sq mm

      Darn, you got there first.

      All the other "whizz bang" stuff in the article and the bit that really stuck out for me was:

      A single 64-bit A35 core with an 8KB L1 cache, the most barebones configuration, takes up less than 0.4mm2 on a silicon die using a 28nm process

      Amazing. Trying to visualise 0.4mm2. I suppose it's about a quarter the size of the full stop printed on my keyboard.

      Then again, didn't the original ARM chip come in at something like 25,000 transistors, only 6 or 7 times the number in the 6502, and about a fifth the number in the contemporary 80286? I have no idea how many transistors are in the A35, but I'm sure this economy of design has continued and probably goes a long way to explaining ARM's frugality with power.


      1. Anonymous Coward
        Anonymous Coward

        Re: 0.4 sq mm

        I am hoping it is actually 0.4 sq mm, i.e. about 0.6mm on a side. Because if it is 0.4mm on a side as the article suggests, that is two and a half times smaller, and my flabber is even more ghasted.

        1. Martin an gof Silver badge

          Re: 0.4 sq mm

          From the ARM website

          The smallest configuration of the Cortex-A35 processor using 8K L1 caches occupies less than 0.4 mm2 and consumes less than 6mW at 100 MHz in 28nm

          Even if it is 0.6mm on a side, that's still pretty titchy.

          Things are very small in the ARM world:

          In its smallest possible configuration with 4KB caches, the Cortex-A5 can be just 0.2mm2 in size at 28nm process

          In a 28nm process, the Cortex-A7 can run at 1.2-1.6GHz, has an area of 0.45mm2 (with Floating-Point Unit, NEON and a 32KB L1 cache) and requires less than 100mW of total power in typical conditions

          and so on...

          Oh, and the ARM website is very nice to navigate. Very fast.


  5. rainjay

    I've been using cheapo A7-powered quad and octacore Android phones for two years now and for most daily usage - web browsing, emails, chats and calls - they're perfectly sufficient. A more efficient yet powerful successor would be icing on the cake. A power sipping 4+4 A35 design would be the best for web browsing, with 4 low speed cores handling scrolling and background work and 4 faster cores doing initial rendering.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020