Too big a queue and pipeline can slow down
If the program has a lot of branches/conditionals already fetched, then stuff gets discarded and it can be slower than a short queue.
ARM, designer of smartphone brains, will today reveal the Cortex-A35: a processor core subtly tweaked to run mobile web browsers and similar apps faster. How can a CPU be tuned for something seemingly so specific? The answer lies in the A35's instruction prefetch queue. ARM has halved the length of this queue in an attempt to …
Yes, browsers have many streams, but this plays such a pathetically small role in the browser process that it means nothing to optimize the streams as such. Things which do help of course in such a manor would be hardware accelerated bit stream handling for JPEG, GIF and MPEG decoding. Though MPEG generally already is managed by hardware codecs. This however will still have little impact.
Browsers are a complicated thing. Most time processing in a browser is based on indeterminate code. Meaning that most of the time sucking processes are related to items dictated by the page authors. The network can be a minor issue as well due to packet fragmentation and the indeterminate packet size where TCP streams are involved. Most modern browsers use almost push oriented stream parsers that trigger reflows of the document structure as more tags are encountered. More smaller frames cause more document adjustment which causes more redrawing and general paint cycles. Larger frames decrease user experience because they take longer to process and transfer.
There are hundreds of operations which a CPU designer could provide a browser developer that would cost little in transistor count that would make real performance improvements without playing stupid games like this. Want the #1 bread winners in my opinion?
1) MMU enhancements for decreasing GDT and LDT complexity as browsers are evil memory fragmentors to begin with. Operations to assist with page reallocation and defragmentation would do absolute wonders.
2) Faster byte oriented operations that decrease alignment oriented issues. Browsers almost always handle data one byte at a time.
3) Hardware Regex DFA engine. Implement a 256 cell 16-32 bit memory grid of direct address only non-DDR RAM for accelerated regular expression lookups and handling. Add a few extra instructions allowing this to be easily pushed to cache and loaded back. This will probably add major improvements.
I could think of a few dozen more... But they'd be useless unless someone downloaded Chromium and actually implemented the code to handle this. Of course modeling the instructions in System-C and integrating the in a browser to identify their benefits would be smart too.
Methinks you're overinterpreting the changes and expected performance improvements.
It looks like the simplification of the architecture leads to a 20% improvement over the previous generation. The real boost comes from, surprise, surprise, boosting the clock speed. This might be enough in the cut-throat section of the market.
As for your suggestions: who's saying they aren't available (for a price) from ARM? Or haven't been added by some makers to their own chips? I think this is the difference between the ARM and Intel value proposition.
Claiiming it can't work because browser code is branch heavy? To me that suggests frequent queue flushes and more memory bandwidth wasted on abandonned op fetches. Reducing the waste with shorter prefetch seems worth trying and i doubt they bothered licencing till benchmarking showed a real improvement.
However counterintuitive it seems to you, trying to second guess the complex interactions at this level is more magic than science and you'll always be surprised, however long you do it.
Darn, you got there first.
All the other "whizz bang" stuff in the article and the bit that really stuck out for me was:
A single 64-bit A35 core with an 8KB L1 cache, the most barebones configuration, takes up less than 0.4mm2 on a silicon die using a 28nm process
Amazing. Trying to visualise 0.4mm2. I suppose it's about a quarter the size of the full stop printed on my keyboard.
Then again, didn't the original ARM chip come in at something like 25,000 transistors, only 6 or 7 times the number in the 6502, and about a fifth the number in the contemporary 80286? I have no idea how many transistors are in the A35, but I'm sure this economy of design has continued and probably goes a long way to explaining ARM's frugality with power.
From the ARM website
The smallest configuration of the Cortex-A35 processor using 8K L1 caches occupies less than 0.4 mm2 and consumes less than 6mW at 100 MHz in 28nm
Even if it is 0.6mm on a side, that's still pretty titchy.
Things are very small in the ARM world:
In its smallest possible configuration with 4KB caches, the Cortex-A5 can be just 0.2mm2 in size at 28nm process
In a 28nm process, the Cortex-A7 can run at 1.2-1.6GHz, has an area of 0.45mm2 (with Floating-Point Unit, NEON and a 32KB L1 cache) and requires less than 100mW of total power in typical conditions
and so on...
Oh, and the ARM website is very nice to navigate. Very fast.
I've been using cheapo A7-powered quad and octacore Android phones for two years now and for most daily usage - web browsing, emails, chats and calls - they're perfectly sufficient. A more efficient yet powerful successor would be icing on the cake. A power sipping 4+4 A35 design would be the best for web browsing, with 4 low speed cores handling scrolling and background work and 4 faster cores doing initial rendering.
Biting the hand that feeds IT © 1998–2020