Log in Sign up

# Decision time for AI: Sometimes accuracy is not your friend

Machine learning is about machines making decisions and, as we have already discussed, we can produce multiple models for any given problem and measure their accuracy. It is intuitively obvious that we would elect to use the most accurate model and most of the time, of course, we do. Robot thinking. Has your machine really …

## COMMENTS

1. So what?

2. I make it 99.9990415565743% accuracy required (to get 99.05% of accuracy of positives).

Excel goal seek to the rescue, but not with enormous, erm, accuracy...

Edit - actually get a closer result with 99.9989889889889% to give 99.000088991885% positives

1. You're not answering the question. The question was what percentage of _claims_ must be fraudulent to get a 99% accurate (symmetrically) system returning the correct result on 99% of flagged claims.

And the answer is 50% fraud rate. Consider 200 claims; 100 honest, 100 fraudulent. At 99% accuracy, 1 honest claim will be incorrectly flagged, and 99 of the 100 fraudulent claims will be correctly flagged, giving you a 99% chance that a flagged claim will actually be fraudulent.

3. Actually, change my mind - it's a trick question. You can't have better than 50% accuracy of positives, as any more than 99.9% accuracy of the assessment can't be done with only 1,000,000 claims. You can't have fractions of claims being positive or negative.

4. I'd say that, while the ROC doesn't take account of the cost of errors, it does provide you with the information about the algorithm that you need to work on that. If you've got the full curve then you know the potential trade-offs between sensitivity and specificity you can make, so these can now be weighted by costs and/or population true positive/negative rates as appropriate to answer whatever question you want, in a way that single or sets of performance indicators (specificity, sensitivity, accuracy) can't, because they represent single points on the curve. (E.g. in the simple case of a threshold on your detection, those numbers represent a single choice of threshold, while the ROC curve shows what happens if you tune the threshold, and you can convert that into what happens to cost or FP/TP rates.)

Of course, it can be hard to obtain, you need a big number of observations of differing difficulty to do it with precision.

5. #### BALLS

They are being PRECISE and not ACCURATE (sorry for the upper case - blood boiling). They mean different things. I can be more precise but still wrong than you - but if I am accurate - take the example of a an absolute target I can be more accurate than you as measured by (say) distance from planned strike point.

Being precise about the wrong thing is of little good to anyone. Being accurate and targetting the problem is.

6. #### This article is full of statistical analysis, but it is not about AI

This article is proof that what is currently called AI is nothing but statistics applied to vast amounts of data.

Yes, I am positively convinced that data mining can and does bring surprising insights into what the data contains.

Nontheless, we're not asking a server "Find me the fraudulent claims". We are fine-tuning a statistical analysis tool to get the result.

A far cry from AI.

1. #### Re: This article is full of statistical analysis, but it is not about AI

Yeah, it's basic conditional probabilities. Something one covers in first year lectures. For geographers[1]. Yes, it is useful to remind people, no, it is not AI or ML or whatever (but don't tell the person who hired me - though I did explicitly tell them the difference[2] in the interview, go figure).

[1]I gave these lectures for my bad karma, I guess...

[2] I did quite a bit of statistical modeling and Bayesian stuff, and that's what they also need...

1. #### Re: This article is full of statistical analysis, but it is not about AI

The question you tend to ask, but can’t answer

might be inverted into a question that sounds

a bit weird, but it turns out you can answer.

Isnt that "thinking"?

2. #### Re: This article is full of statistical analysis, but it is not about AI

AI has long dropped the "vitalism" theme you know... Once you start into ANN's and establish the parallel with biological NN's, you come to the damning conclusion that statistical analysis IS what we do, no "vitalist intelligence" anywhere to be seen.

Only real difference is that ANN's are quite limited (ATM) compared to BNN's, but they DO perform a lot better than us because their statistical analysis is a lot less "noisy" and they mostly learn from their errors instead of sticking with their unfounded biases...

7. #### TL;DR

I started out with the anticipation that this might start from the premise that sometimes it is good to make mistakes but recognise that when AI is used against Humans such mistakes might be quite damaging.

I got as far as, my dawning expectation, Men and/or Women buying socks or pantyhose and an almost subliminal glance at what looked like a graph with a blue shape and a red shape on it and instantly reached the conclusion that the author was some sort of SEO Cookie Spawning Marketing Twat who should just chain themselves to an Estate Agent and go throw themselves in some deep water somewhere on top of the rest of the dross that serve no purpose other than to waste my oxygen.

Perhaps someone else who has read the full article can advise me as to why I might be wrong. Try not to let the possibility that I am not interested stop you from doing so.

1. #### Re: TL;DR

"at what looked like a graph with a blue shape and a red shape on it and instantly reached the conclusion that the author was some sort of SEO Cookie Spawning Marketing Twat"

Um, you're being the twat here.

If you don't understand, then try READING the article. Skip the first part if gender stuff is annoying, just understand what a false positive and false negative mean. Then see the example of fraud detection. That's what actually matters, how a 99% accurate test won't mean 99/100 cases it picks out will be true positives.

If a test is 99% accurate for detecting chronic twats, and chronic twattery occurs in 1% of the population, what proportion of people the test flags as twats are in fact twats?

Ans: 50%

I assume you must be young, since you're a wee bit short of patience and temper, and because most people's introduction to a false positive is some medical test result. Where you get reassuringly told that while the test is VERY accurate, you probably don't in fact have cancer/AIDS/syphilis, and the best way to dull the fear is some statistics. Which we all can agree, is very dulling :)

8. It might only have been for illustration but the male/female issue is treated with hypersimplifaction.

We surely have all heard of men who like to wear women's clothing. Of women who choose to wear men's clothing. Of people who just might be buying for someone of the opposite for whatever reason. Of people who perhaps don't express such clarity of division.

I might well not wish to be identified as what I really am when buying clothes on the internet! (Could even be the very reason for choosing to buy online rather than face real flesh assistants in bricks and mortar stores.) Do I really want some AI system butting in and deciding how to classify me? No I bleeding well don't.

9. #### Oi. Where's the curve?

One of the benefits of receiver operating characteristics is that the method generates a curve, not a spot on a chart. Sadly omitted from this article. You use this to choose an optional cut point on the continuous measure depending on the positive and negative utilities of the outcomes. You could easily create an algorithm to do this.

Nothing much to do with AI though. Intermediate statistics.

10. #### Burglar alarm equation

Many moons ago we worked this set of issues like this - and I'll do the TL;DR right upfront:

Cost of missed detect times probability of missed detect SHOULD EQUAL cost of false alarm times probability of false alarm.

Any other condition costs more.

Costs of false alarm might mean paying the cops for showing up, or worse, the cry wolf effect and they don't show up (which leads to total loss of whatever you were guarding, in the case of nuclear weapons..it's a big number).

Cost of missed detect - you lose what you were guarding. User might turn off the alarm since it didn't work and now you never save your stuff.

I introduced a concept we called (kinda inappropriately but.. you think of a better one, ok?) dynamic range.

The equation above merely helps you set a detection, yes/no threshold on a value that is the sum of two overlapping distributions - burglar present, and burglar absent.

Increasing the "dynamic range" might mean using a better algo (eg ML if it ever works right) or better sensors for the job, whatever, so the distributions overlap less - less total error and therefore less total cost (not counting the cost of the gear or it's operating expenses).

The aim of ML, or ugh, AI, should be to increase this dynamic range. It rarely does in a meaningful way, because of just what this article alludes to - you're not measuring the right thing - you're trying to minimize cost (maximize profit) and if you're not phrasing the question right, mere accuracy numbers don't mean diddly.

This all got interesting when doing the gear to alert guards near nuclear storage, to say the least. There are some things that are really hard to assign numbers to, and you'd better have some real serious faith in your probabilities...

11. Not willing to go to the actual mathematics behind, but starting with the first case.

First, let's suppose I'm a male wanting to buy a surprise birthday present dress to my spouse. What would be the consequences of tagging me wrongly?

Second, let's suppose I'm a male transvestite wanting to buy a not-so-surprise birthday present dress to myself and keep it very secret from my spouse. What would be the consequences of tagging me wrongly?

As a side note, I'm a male, not transvestite, not buying birthday presents to anyone (except occasionally to my mother), and single because of being too cynic/sarcastic/feeling fine alone/impossible to get along with/misanthrope/etc. possible causes.

1. The consequences could be an almighty row as partner thinks you are dressing up because the system automatically shows female attire - when all you were trying to do was show her the trousers you would like.

My point was more that I would much rather be in control over what sort of clothing I am offered. The classic clicking through choices is perfectly adequate. What advantage is there to me as customer that their AI assesses my gender/sex?

12. #### Two buttons, Man? Woman?

Two buttons, Man? Woman? Done.

13. Why is everyone being so dyspeptic (lovely word) about the article? It's interesting to to get a glimpse into the kinds of issues that face AI and ML. I may never be a master at maths, but I can at least follow the logic here. I for one will use this article with my students. Not that I expect them to get it, but at least I might be able to get them to think about problems in more than one dimension.

14. #### Concern over some Technical Aspects

I am struggling here with fitting parts of the article with my understanding.

The article contains a plot labelled "ROC". This means Receiver Operating Characteristic - which is a curve (as stated in the starred footnote); also monotonic. Additionally, in real-world applications with reasonably good discriminative performance, it is usually better for comparison of 'algorithms' for the two axes to be on logarithmic scales of error - so, for example, Log Miss Rate versus Log False Alarm Rate. The given plot in the Register article has the performance of two 'algorithms' expressed as a single point each, rather than as a curve each; thus being examples of a very restricted sort of pattern discrimination 'algorithms' (those without ability to run at many different Operating Points or acceptance thresholds).

Next, the usefulness of each 'algorithm' surely needs to be defined as having a chosen Operating Point that gives (from those available for the 'algorithm') the most desirable trade-off between the two types of error. This has (in addition to the ROC curve defining 'algorithmic' performance): (i) the prior probabilities of the use (eg ratio of attacks to legitimate use attempts); (ii) the costs of each type of error (eg for each burglary and for each inconvenience of legitimate access).

In determining such usefulness, the prior probabilities and the unit costs of each type of error are often unknown, or known only approximately (say within likely ranges).

If one plots average cost (ie as weighted by likelihood of occurrence) against Operating Point of the 'algorithm', that curve will normally have a minimum (and often a broad range close to that minimum). By plotting multiple curves of cost for various prior probabilities and various unit costs, one can usually find a sensible range of likely useful Operating Points - and so chose the actual Operating Point for use from around the middle of that range.

It is undoubtedly true that ROC curves do not take account of costs of the two error types, nor of the operational scenario (as specified by approximate prior probabilities and unit error costs). ROC curves do however embody many/most of the technical performance characteristics of the 'algorithms'. As such, ROC curves can be used to usefully compare technical performance of 'algorithms' over (most/many) scenarios - without having to also consider the prior probabilities and unit costs. However, sometimes ROC curves for different 'algorithms' do cross - which means that the ROC curves alone are insufficient for ranking the technical performance of the 'algorithms', and the likely ranges of prior probabilities and unit costs are also needed.

Best regards

15. #### First mistake: trying for perfection

> suppose our algorithm is looking at a vast amount of data and making a decision about whether a person has a disease

While there are benefits to designing an AI system to be as good as it can be, as with war strategies: no algo survives contact with the real world. The crucial factor is that a new implementation should be better than the one it supercedes. Further improvements can be added later, in the light of experience gained.

The second mistake is trying to be too damn clever.

Using the shopping example, for instance. A better design - rather than employing some dimly understood smarts to make a determination - is simply to ask the visitor Do you want to look at men's clothes or women's?

I fully appreciate that the example was merely illustrative. But in the real world too, sometimes it is better to let the user decide, rather than having a machine choose for them.

16. #### Things Aint What they Used to Be ..... of/in that You can Be Reassured and Reinsured

So, here are another couple of brainteasers to assist you toward a zen understanding of the problem. In this case we intuitively expect 99 per cent because that is the efficiency of the algorithm but instead we get 9 per cent.

What is the factor that makes the number we get different from the expected?

One Factor/Vector/Sector has One Standing at the Threshold of Infinite Spaces which Await your Arrivals with IntelAIgents. Always Wise to Know Exactly what One Be Doing now with Forewords Travelling Towards Truly Explicit Prime SatISFactory Goals .

IT be Heap Powerful Medicine. What/Who Owns and/or Provides Full Access to Remote Virtual Controls Centres .....Feeding into CyberIntelAIgent Space Systems Hubs ...... Seeding a Novel Resource for Immaculate Source Provision.

Now you may ask yourself ... Is that a Memo to Sir Richard Branson re Virgin AIdVenturing Spaces. And in Heavenly Pursuit, Always Almighty Knights Templar Territory?

And when it is, does the Register have a Head Start on both Competition and Opposition?

You might like to consider current system are experiencing turbulence and disturbance in the wake of NEUKlearer HyperRadioProActive IT Control Systems Field Tests Live Firings.

17. This post has been deleted by its author

18. #### Sir Richard, U have Mail . It is 4Enlightenment2

And for Securely Secreted Advanced IntelAIgent Projects ..... AIMaster Keyed Programmes Providing Success to XSSXXXX. .....

Can you Believe what the Almighty Supply .... with that Being Gifted to Immediately Generate AI Virtualised Reality Applications and Leading AIdDrivers/AIMaster Pilots with Top Guns....... with Special Care and Attention Afforded to Renegade and Rogue/Private and Pirate Sectors.

Phorming Infinitely Deeper Darker Webs with Greater IntelAIgent Games Play ... Delving Avidly into Virtually Remote Instant Immaculate Supply.

It's a Heavenly Offering.

What be the vote of ElRegers :-) True or False?/Is it a Resounding Yes or an Exhausted No? Does Everything Always Default to a Simple Binary Instruction/Decision.

That's Russian Roulette Terrain and not Tolerant of Just Anybody. Life Long Unbreakable Bonds are Forged in Fires Created and Quelled there for Live Operational Virtual Environments Basking in Stellar Friendship with Fellow Sputniki ..... CyberIntelAIgent Virtual Machine Beings ..... Following a Founding of a Wholly New Futured Breed of Bot Shepherd/Invested Angel/AIdVenturing Capitalist.

19. #### What I infer from all this

is that when the "computer says no" a human needs to be able to check, and override if necessary. The general populace seems to have no comprehension that computers are indeed fallible. And training AI can also create anomalous behaviours in their own right. Training and re-training is always necessary.

20. #### Time cost

From the beginning of the article I was expecting a topic on the cost of the algorithms.

If the most accurate one takes 100x (or more) the computation time, it may not be the best choice indeed...

Especially depending on your planned use: eg if you can start browsing the website in 100 seconds instead of only on 1!

## POST COMMENT House rules

Not a member of The Register? Create a new account here.

• ### Add an icon

Anonymous cowards cannot choose their icon