# Fujitsu sidesteps data scientists with a move toward tuned machine learning

Simple questions can be difficult to answer when the predictive analysis needles being looked for are buried in a 50-million-record haystack. However, so-called Tuned Machine Learning techniques can be used to automate data scientists' work, and get answers in a couple of hours that used to take a week or more. The questions …

1. #### But where is the commonsense reasoning at?

Required reading for the newfangled data scientist: Old problems still persist as problems: Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence

Who is taller, Prince William or his baby son Prince George? Can you make a salad out of a polyester shirt? If you stick a pin into a carrot, does it make a hole in the carrot or in the pin? These types of questions may seem silly, but many intelligent tasks, such as understanding texts, computer vision, planning, and scientific reasoning require the same kinds of real-world knowledge and reasoning abilities.

1. #### Q: Can you make a salad out of a polyester shirt?

A: MacDonalds

2. #### Re: But where is the commonsense reasoning at?

The oranges you're asking about can't be found in this particular collection of apples.

"Commonsense reasoning" is just a matter of incorporating a sufficiently rich world model in an ML system. The rub, of course, is that "incorporating" is a complex problem, "sufficiently rich" is often an infeasible one, and "just a matter of" approaches "the impossibility of" as the problem domain grows.

But generally speaking, most ML problems fall into one of two classes: one where we have some ideas about how to build that world model in theory but don't have the resources to do it in practice; and another where it's not appropriate for the kinds of problems we're interested in solving. And the problems Fujitsu are talking about here fall into the latter class.

2. #### Why cheesoid exist

Why does anything exist - as cosmologists are always asking

PETRIL

1. #### Re: Why cheesoid exist

Damn... you got there first!

3. #### Accuracy != Robustness

Having worked for 15 years or so in analytics (which has somehow now become data science) it's been great to see the recent expansion of available tools, particularly open source projects such as Spark. While I welcome all of them, this particular initiative seems to be missing a number of very important points that generally constitute the time-consuming part of an analyst's work. First, a single accuracy measure is often not terribly useful - especially for rare events. [Simple example, if something only happens 1% of the time then a model that simply says that things will never happen will have 99% accuracy. Clearly that's not a very useful model.] Generally it's more useful to look at precision vs. recall, which are inversely correlated and influenced by the cut-off rules applied in the model so you are looking for the optimum trade-off between the two. There are measures that combine the two into a single value that can be compared across different models, but the choice of the measure is normally subjective based on business needs.

Secondly (and especially important for large-scale machine learning problems with very large numbers of features) you also need to understand the amount of bias/over-fitting in your model - i.e. how likely is it that the model performance seen in train & test data sets will be seen in new data. This can be handled through cross-validation techniques (which I don't see mentioned in the article?), but again often comes down to a subjective evaluation of the trade-off between "accuracy" and "robustness". It is these subjective evaluations that generally take most time rather than waiting for computers to crunch through data.

In short - I would take these claims with a massive pinch of salt in terms of their impact on real-life analytics (oh, okay ... data science) applications. If you have analysts spending a week to create a single model (even if they are testing different algorithms) then you have a problem with your analysts not your machine learning tools. And, of course, creating a good predictive model doesn't actually do anything for your business - you generally need to fit it into a robust decision management framework, which typically is much more complicated.

/rant

Apologies for going on about this, but as happy as I am to read more about machine learning in El Reg, I'd love for you to apply your usual level of skepticism. Happy to provide awkward questions for you to ask if you'd like!

1. #### Re: Accuracy != Robustness

Furthermore: "Where the analysis takes more than 12 hours it’s usually applied to a sub-set of the data and the result is less accurate."

Is patent nonsense. Firstly, if you are doing analysis on all the data then you have all the possible results. But let's assume you mean all the data available now. Even so there is a very well understood technique called sampling that is at the heart of statistics. The accuracy* improvement you get from going from a very large dataset to a very very large dataset is usually very small.

Finally there are a whole range of reasons why an analysis can take more than 12 hours. Run it on your ZX81 cluster and you would be delighted to get any answer within 12 hours.

*see above comment for more on accuracy.

2. #### Re: Accuracy != Robustness

Agree completely!

Anything above 95% screams of over-fitting or the data are so noddy that ML is irrelevant.

I also am happy to provide rock salt to go with these articles.

I'd give more upvotes if I could.

4. #### Next Stop on the Autonomous Robot Line- Singularity Blvd.

I'm thinkin' machine "learning" is just another way of saying autonomous self programming here in the nascent Church of the Cosmic Algorithm. Buckle up kidz...

## POST COMMENT House rules

Not a member of The Register? Create a new account here.