Alternative headline: CompSci boffins find that logs are overly verbose and duplicated.
CompSci boffins claim they can recreate missing lines in log files
CompSci boffins think they've come up with a novel way to recreate missing entries in log files. In a paper titled Bagging Recurrent Event Imputation for Repair of Imperfect Event Log with Missing Categorical Events, Dr Sunghyun Sim and Professor Hyerim Bae (both from Pusan National University in South Korea), and Professor …
COMMENTS
-
Wednesday 15th December 2021 12:13 GMT Wally Dug
System Error
"restoring missing event values... which can overcome human or system error."
What exactly is a "system error"? Is a cronjob that wasn't run due to <<whatever>> a system error and, if so, could an entry be inserted into the log file for that "missing" run when in actual fact we need the entry to be not there? The last sentence is completely true: "...imputed logs clearly have potential to make life interesting for digital forensics practitioners." Perhaps for admins too.
Maybe I'm missing the point, but surely if a log file is so critical that there will already be security in place?
-
Wednesday 15th December 2021 12:16 GMT Loyal Commenter
Using "AI" to amke guesses
So, what they are doing is creating log entries that the software deems *should have* been there, with time-stamps that it reckons are about right.
Thus rendering one of the main purposes of a log prone to error. If I want to read through a log file (and want is probably a bit of a strong word there), I will almost certainly want to know the exact sequence of events, which are likely to have occurred in close succession.
Given the nature of modern multi-threaded and asynchronous programming, the timing and sequence of events can be very important in tracking down and diagnosing issues. If some "AI" has come along and inserted entries into that log file with "best guess" timing / sequence / content, it is going to be actively counter-productive.
I'd be focusing instead on why some of your log entries aren't getting recorded accurately in the first place, because this sounds like a "clever" solution for an imagined problem. I can't say I've ever experienced this sort of thing happening with any of the logging frameworks I've ever used.
-
-
Wednesday 15th December 2021 13:13 GMT ThatOne
Re: Using "AI" to amke guesses
> there is absolutely no benefit in imagining something that might perhaps fill the slot; it tells you exactly nothing
Yes, but it's neater... And since quite often the letter of the rule is way more important than the spirit, you need to have clean, neat, complete logs, no matter what's in them.
[15/Dec/2021:08:02:21] Lorem ipsum dolor sit amet, consectetur adipiscing elit
[15/Dec/2021:08:02:58] sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
(and so on)
-
Wednesday 15th December 2021 15:09 GMT amanfromMars 1
Re: Using "AI" to amke guesses
If there's nothing in the logs after an event, there is absolutely no benefit in imagining something that might perhaps fill the slot; it tells you exactly nothing. .... Neil Barnes
Surely one cannot be serious and actually believe any or even all of that, Neil?
Such would virtually tell anyone with an earnest honest interest practically everything needed to be known and not done with regard to the event.
-
-
Wednesday 15th December 2021 16:15 GMT Loyal Commenter
Re: Using "AI" to amke guesses
I'm going to make a wild guess that they are actually academics who don't have the multi-decade real-world experience of a lot of the commenters here, so exactly the opposite of that is true.
It's probably someone's thesis topic. "Pick an interesting problem to work on. No, it doesn't matter if that problem doesn't really exist, Knuth solved all the real ones decades ago."
-
Wednesday 15th December 2021 16:59 GMT Wellyboot
Re: Using "AI" to amke guesses
Two things spring to mind.
Having a log just stop gives an easily spotted point of FUBAR to work back from, not many of us will appreciate trying to find the last real entry just to start the process.
If the overall system is running well enough to be able to produce made up log entries it can B****y well punt a live message to someone saying 'X' has just packed in logging the events we were expecting.
-
-
Wednesday 15th December 2021 18:58 GMT jake
Re: Using "AI" to amke guesses
I'm going to make a wild guess that the people who worked on this probably know a little bit more about how to justify recieving grant money than people (like me) who got our degrees, and then got out of Uni and entered the real world. Otherwise they wouldn't bother doing it.
-
-
Wednesday 15th December 2021 16:27 GMT Loyal Commenter
Re: Using "AI" to make guesses
As it happens, one of my many and varied dumpster fires that needed putting out today involved doing exactly this, unpicking log files from two different sources, one of which logs things happening in parallel in multiple threads, line by line, to work out the sequence of events to determine at exactly which point an API returned an internal error, to try and infer why.
If any of those log entries had been "filled in", either with "expected content", or a timestamp from somewhere else (hint: not all sever clocks are synchronised ot the fraction of a millisecond, but the times in these logs are accurate to that degree, and entries are always written in the order that they are logged, even if they have the same time stamp), then this could very well have led me to the wrong conclusion, which, thankfully was of the "it's someone else's problem" variety.
On the other hand, if a log is so regular that you can easily infer the order that entries should occur, even if such entries are missing, then you're not really writing a useful log. You're either writing an audit, or wasting disk space. You probably don't want to be writing software that automatically falsifies audits for you.
-
Wednesday 15th December 2021 16:40 GMT Zippy´s Sausage Factory
Re: Using "AI" to amke guesses
What's the betting this gets used in court. Someone says "someone deleted from our log file... must be hackers" and uses the AI to "rebuild" the log files. Those get submitted in court, and of course because it's AI and it's a computer it "never makes a mistake".
I mean OK this is a bit of slippery slopeism and it probably says more about my cynical worldview than anything, but as usual we have to be careful with AI and remember that it isn't really intelligent, it just pretends to be.
Anyway, I'm off to go and hide in my cupboard. Might do a bit of moaning and wailing later, if I feel in the mood. (Gnashing of teeth is a luxury I reserve for the weekends).
-
Wednesday 22nd December 2021 10:54 GMT oiseau
Re: Using "AI" to amke guesses
... inserted entries into that log file with "best guess" timing / sequence / content, it is going to be actively counter-productive.
You are too kind.
The net result will be rubbish.
... sounds like a "clever" solution for an imagined problem.
Quite so.
But there's been a lot of that going around in the past few years.
Think Poettering's systemd for a good example of that.
O.
-
-
Wednesday 15th December 2021 12:31 GMT andy 103
correlates data *from other sources*??
"Recreating" lines isn't really accurate then. All they're doing is getting data that has already been recorded from other sources and then trying to work out where it fits into a file with "missing" data.
Why is time, energy and effort being spent on these bullshit activities?
The three authors couldn't find a tool to recreate missing events. So they built one that correlates data from other relevant sources.
If the data is already there, then the actual real problem is that some people don't know where it is.
I can't envisage this being used in any serious or critical application. Imagine if flight data recorders worked on this premise. We'll just try and guess the sequence of events so we can put everything into 1 convenient file, rather than having the prerequisite knowledge to determine them accurately... Fuck off.
-
-
Wednesday 15th December 2021 17:44 GMT Loyal Commenter
Re: correlates data *from other sources*??
Just because you *can* do something, does not mean that you *should*.
I may be missing the real-world use case for this, but it sure sounds like it's an "academic problem".
Bemoaning that others "don't understand" is more than a mite condescending. I'm pretty sure most people have understood what they are doing, we just don't know why you would want to do that.
To most analytical minds, the absence of an entry in a log file tells us more than having a guessed-at entry filled in, in its place. Especially so, if that entry is being constructed from other data, because it's pretty obvious that if it is missing from the log file, then we should go looking to see where it actually is, and in doing so, hopefully gain an understanding of the sequence of events that has led to that situation.
Log analysis is an art more than a science. More often than not it will consist of a process of filling in gaps ourselves to determine the sequence of events that could have happened. In doing so, this may raise other questions, which may well lead us towards a real underlying problem that needs solving. If you hand this process over to an "AI" to make a best-guess at everything, then this problem-solving process never happens.
-
-
-
-
Wednesday 15th December 2021 15:51 GMT Antonius_Prime
Re: BOFH
It's OK.
Repeat in front of a mirror until you can say it with the most shaken, heartbroken expression you can manage (and not giggle):
"There's been a terrible accident..."
Apropos of nothing, anyone seen my bag of quicklime and my roll of carpet? I put them down when I went to get my print out of poorly surveiled woodland sites and building sites with deep concrete pours occurring soon...
-
-
-
Wednesday 15th December 2021 14:50 GMT Peter Galbavy
Use an AI guessing to train another AI and lie about "evidence". Nice. Just what some politicians need.
Event logs are very often used as evidence - not necessarily the legal kind - to establish the sequence and timing of events, who/what was involved and responsible. Tampering with those event logs is just like any other record tampering, even if it's tied up in a nice red bow and a gift tag that says "With Love from your favourite AI".
THen the side note about logs being used to train AIs is in itself suspicious. If you use fake records to train an AI then all you are doing is reinforcing whatever bias you decided was important to you.
Is there a rotting fish icon?
-
Friday 17th December 2021 12:23 GMT amanfromMars 1
Misinformation is not a Great 0Sum Game
Event logs are very often used as evidence - not necessarily the legal kind - to establish the sequence and timing of events, who/what was involved and responsible. Tampering with those event logs is just like any other record tampering, even if it's tied up in a nice red bow and a gift tag that says "With Love from your favourite AI”. ..... Peter Galbavy
Tampered event logs in the West are invariably default tagged and gifted as if “From Russia with Love”
Can you imagine the insight/foresight such a gross mischaracterisation delivers to those in the East? It tells them practically all that they need to know about the weaknesses being attacked and defeated in the West.
-
-
-
Wednesday 15th December 2021 15:30 GMT Anonymous Coward
Iteration 1. There was no event.
Iteration 2. There was an event, but we joined the event database and the rule database, so the event must have obeyed the rules.
Iteration 3. There was an event that broke the rules but we weren't there.
Iteration 4: I join your denial to the 'they all lie, all the time' axiom and hey presto: truth!
-
-
-
Wednesday 15th December 2021 18:10 GMT diodesign
Re: Example?
I've added an infographic and a link to a summary of the study by one of the universities. It basically, to me, works by figuring out what data from various sources is needed to create a log's entries, and then automating the process of generating missing entries from that data.
C.
-
Wednesday 15th December 2021 23:15 GMT Doctor Syntax
Re: Example?
It's still no clearer exactly what they're doing because it's just a pile of jargon. It's the "recurrent event imputation" that concerns me. The nearest I can make of it is "There's usually an event of type X here but there isn't in this case so let's add one." Possibly it means something different and got lost in translation from the Korean.
-
-
-
Wednesday 15th December 2021 18:23 GMT Pascal Monett
"what the log entry should have been"
That is nothing more than rewriting history.
The absence of information can be just as significant as its presence. If something exists in one source and is absent from another, that means that there is a process that failed and could not write a log entry. For debugging purposes, that is literally more important than the pseudo re-creation of log data.
-
-
Friday 17th December 2021 07:13 GMT amanfromMars 1
What I can tell y'all about these times ...
Sometimes what isn't there is the important thing ... this, if implemented, will be nuke on sight on any system I admin ... just like any other form of malware. ..... jake
Things have moved on by quantum leaps and bounds, jake, into new fields of terror and/or excitement with the realisation of an enigmatic achievement which can neither be effectively attacked and gratuitously assaulted nor ever physically damaged and virtually defeated.
Always sometimes what isn't revealed there is the important thing ... for that, whenever correctly configured and implemented, cannot fail to nuke on sight any systems administration like no other form of unknown malware or known software empowering hardware and vapourware/ponziware/zombieware
To some who would be many is that a Doomsday 0Bug to Fear and Server, to A.N.Others and a Few a Heavenly Delight to Diabolically Savour and Favour ......... and a Present Code Red Conditioning Event to Deny is on ACTive Mission PACT Manouevres ‽ .
And for those who
mayneed to know* what they are trying to deny is a current situation ....say hello and welcome to Advanced Cyber Threats and Persistent ACTive CyberIntelAIgent Treats and all possible variations and reverse engineerings of those themes and memes.* ....Royal Chartered, £2.6bn granted, UKGBNI Cyber Security Council ??? :-) It just wouldn’t be fair on them, would it, for them to be able to plead complete ignorance of such an affair hence their being specifically singled out and highlighted in this post although one doesn’t have to be a genius or an Einstein to realise there be at least a few others worthy of mention who might wish to more fully avail themselves of such novel info and disruptive intel in order to take overwhelming advantage of its many benefits and massively utilise its myriad pitfalls/exploitable 0day vulnerabilities.
????? Surely you do not expect the Future to be anything like the Past and bear any responsibility for continuing woes in the Present.????? That would be to suggest madness rules, progress has not been made and evolution is halted..... which is clearly preposterous and evidently ridiculous.
-
-
Wednesday 15th December 2021 20:56 GMT ibmalone
That there real world...
Okay, so computer scientists have an obsession with imputation that borders on the unhealthy, and we all know you can't recreate information that's not present in your starting data.
But what about this? Compare your imputed log with the original. What's missing that's expected? What's different? That's the automated version of the exact ad-hoc process many people describe themselves using. If I ever have to go log-trawling I like to have a copy from a good state to compare to.
It's not actually the authors' aim though, as reading the paper reveals. They basically just want to repair incomplete logs so they can carry out process mining (read, do stats on), not forensics (the word doesn't appear once in the paper). Imputation is a fairly common technique in stats, although you need to conduct sensitivity analysis to check it hasn't altered the results, it's generally used to patch up a method that can't properly handle missing data.
Anyway, I always get suspicious when people start talking about "the real world" as if it's somewhere they inhabit that others don't. It usually just means you think your own experience is universal.
An example: the logs they are talking about are not computer system logs as most of the replies above appear to assume. They are logs for things like container handling processes. Which answers another question, why would anyone spend money on this? "This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1A2C1102294) and research projects of ’Development of IoT Infrastucture Technology for Smart Port’ funded by the Ministry of Oceans and Fisheries, Korea."
Yeah, those ivory tower academics with no idea about the real world.
-
Thursday 16th December 2021 00:11 GMT Doctor Syntax
Re: That there real world...
OK, I get that. What they seem to be doing is cleaning up real world logs to present their system's best guess of what the logs should look like to train another system to spot discrepancies in real world logs.
What could possibly go wrong?
Scenario 1: Say they get 90% of the input doing one thing, 5% doing something else and 5 off 1% doing individual other things. They decide that the 90% is what's right, clean up the remaining 10% to look similar and train the second system on those. The second system gets more examples of the 5% & starts flagging them as errors. In fact that was a legitimate outcome but because the imputation system fudged the data the second system was mistrained. Note that in order to do its thing the imputation system must have noted these variations and could usefully have flagged these as something to be reviewed by an actual real live expert.
Scenario 2: Same sort of results but all the discrepancies are simply failures in the logging system. The second system starts throwing errors looking at real world data because the logging system is making similar errors. The logging system is not fit for purpose and no amount of cleaning of the training data is going to fix it.
The application area seems to be logistics. Any time I've been on the (non-)receiving end of a logistics error it's been fairly clear to me that something hasn't been scanned in or out when expected. What's missing is Real Intelligence when designing and implementing the system to raise and alarm in real time when the expected has failed to happen. No amount of Artificial Intelligence applied after the event is going to fix the problems in anything like an effective manner.
-
Saturday 18th December 2021 00:39 GMT ibmalone
Re: That there real world...
Ah, I've confused the situation here, I did write that starting suggesting (as a comparison) as an example of another way you could look at the situation (given everyone had gone straight to assuming it was for the stupidest possible application), before reading very far in the paper. Considered taking it out and possibly should have. Though I think the point still stands, even if this was a process log, you could still probably find a use for such a tool.
In terms of the actual application, it is logistics, but they're not using it to try to fix parcel tracking, they're using it for (to my understanding) plug in to models to measure how efficiently things are moving around ports and the like, and said models really don't like if something disappeared, even if it's obvious what happened next.
-
-
Thursday 16th December 2021 11:59 GMT Loyal Commenter
Re: That there real world...
I have worked in both academia, and in the world of business, and can categorically say that the source of funding for an academic project bears very little relationship to the people doing the research.
Typically, you would have an academic of some sorts (usually a lecturer or professor) scraping around any and all funding bodies of which they can get their foot into the door, in order to beg for whatever money they can get.
Once they get a source of cash, they then go looking for one or more researchers to do some work on it, typically PhD students, or post-doctoral researchers, but sometimes final year undergraduates if the money is thin. When the work is done, they publish a paper with their name on it, and, if they are still on speaking terms with the people who did the actual work, their names as well.
At least, that's how it works in the UK. I would imagine that universities and funding bodies in South Korea work in much the same way.
Now, it may well be that they are talking about shipping logs, and not log files spat out of a computer. Fundamentally, though, these are the same thing; a record of something having happened. If the record doesn't show it happened, it seems foolhardy to write a bit of software that fills in the details to make it look as if it had. If the whole point of the exercise is to "improve" the quality of low quality data for whatever purpose, then I'll refer you to the old adage of, "Garbage In; Garbage Out". If you're going to try and use that "corrected" data for any sort of statistics, then I'm sure I can find a room full of angry statisticians for you who'll chew your ear off about error bars and confidence intervals.
-
Saturday 18th December 2021 00:32 GMT ibmalone
Re: That there real world...
I'm familiar with the general process, although the description ignores that funders have increasingly specific ideas about what they want.
Interestingly the paper quotes the exact same phrase you do. Error bars and confidence intervals exist with any data, purely from sampling if nothing else and, yes, statisticians will happily talk about imputation with caveats. One pretty basic trick you can do for example, if what you are interested in is a metric calculated from said logs, is take a set and apply random dropouts and see how that affects your metrics on the recovered copy. The authors in this paper do exactly that. Because of course "garbage in garbage out" assumes that you're starting with *garbage* rather than merely a slightly degraded signal. You can still listen to music with a few scratches on the record, yuo cna mkae sesne of thys snetnce, you can extract meaningful data from a flow logs with some missing entries. Particularly if your filling in can access other data sources for the recovery. The only data I can't extract is how that is controversial.
-
-
-
Sunday 19th December 2021 14:20 GMT hayzoos
I sure hope they follow a process of creating a new synthetic/virtual log. Realistically, this is nothing new. I routinely assembled consolidated logs from individual computers in a system. Then further processed to create a synthetic log in a standardized format with expected entries calculated and inserted with a flag as a calculated entry. Individual raw lags were retained, any intermediary processed logs were retained, basically all source, in-between, and final results were available for review.
systemd was not widely distributed when I was practicing this so was not part of my equations. Windows logs are hybrid text/tokens in native format and require appropriate dll's for messages represented by the tokens. With Windows there were so many variables in tokenization that immediate conversion to pure text was required on the machine that produced the logs. Various flavors of *nix had their own quirks as well. Text is the lowest common denominator. All timestamps were converted to "YYYY-MM-DD HH:MM:SS.####" format in zero TZ offset and from summertime/daylight saving time if required.
As I was reading the article and comments, systemd popped into my head. Not so much from the standpoint of it's log format. Not from the possibility that this feature sounds like a perfect fit for systemd to adopt. But from the fact that systemd is so complex that it would likely to be the cause of dropped log entries in it's own and other system logs.