back to article CompSci boffins claim they can recreate missing lines in log files

CompSci boffins think they've come up with a novel way to recreate missing entries in log files. In a paper titled Bagging Recurrent Event Imputation for Repair of Imperfect Event Log with Missing Categorical Events, Dr Sunghyun Sim and Professor Hyerim Bae (both from Pusan National University in South Korea), and Professor …

  1. Disgusted Of Tunbridge Wells Silver badge
    Holmes

    Alternative headline: CompSci boffins find that logs are overly verbose and duplicated.

    1. Wellyboot Silver badge

      and can be edited.

      1. W.S.Gosset Silver badge

        and can NOT be edited, Your Honour.

        FTFY^H^H^H^H

  2. Wally Dug
    WTF?

    System Error

    "restoring missing event values... which can overcome human or system error."

    What exactly is a "system error"? Is a cronjob that wasn't run due to <<whatever>> a system error and, if so, could an entry be inserted into the log file for that "missing" run when in actual fact we need the entry to be not there? The last sentence is completely true: "...imputed logs clearly have potential to make life interesting for digital forensics practitioners." Perhaps for admins too.

    Maybe I'm missing the point, but surely if a log file is so critical that there will already be security in place?

    1. ThatOne Silver badge
      Facepalm

      Re: System Error

      Hear that distant rumbling? That's hackers shivering with anticipated pleasure! Just delete any incriminating log files about your activities, and the victim's storyteller will create new innocent ones to replace them...

  3. Loyal Commenter Silver badge

    Using "AI" to amke guesses

    So, what they are doing is creating log entries that the software deems *should have* been there, with time-stamps that it reckons are about right.

    Thus rendering one of the main purposes of a log prone to error. If I want to read through a log file (and want is probably a bit of a strong word there), I will almost certainly want to know the exact sequence of events, which are likely to have occurred in close succession.

    Given the nature of modern multi-threaded and asynchronous programming, the timing and sequence of events can be very important in tracking down and diagnosing issues. If some "AI" has come along and inserted entries into that log file with "best guess" timing / sequence / content, it is going to be actively counter-productive.

    I'd be focusing instead on why some of your log entries aren't getting recorded accurately in the first place, because this sounds like a "clever" solution for an imagined problem. I can't say I've ever experienced this sort of thing happening with any of the logging frameworks I've ever used.

    1. Neil Barnes Silver badge

      Re: Using "AI" to amke guesses

      Thank you - you saved me saying exactly that.

      If there's nothing in the logs after an event, there is absolutely no benefit in imagining something that might perhaps fill the slot; it tells you exactly nothing.

      1. Steve K

        Re: Using "AI" to amke guesses

        It's precisely wrong...

        1. jake Silver badge

          Re: Using "AI" to amke guesses

          It's actually not even wrong.

          It's a fabrication, a guess, a story, and has no place in (especially!) forensics.

      2. ThatOne Silver badge
        Devil

        Re: Using "AI" to amke guesses

        > there is absolutely no benefit in imagining something that might perhaps fill the slot; it tells you exactly nothing

        Yes, but it's neater... And since quite often the letter of the rule is way more important than the spirit, you need to have clean, neat, complete logs, no matter what's in them.

        [15/Dec/2021:08:02:21] Lorem ipsum dolor sit amet, consectetur adipiscing elit

        [15/Dec/2021:08:02:58] sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

        (and so on)

      3. amanfromMars 1 Silver badge

        Re: Using "AI" to amke guesses

        If there's nothing in the logs after an event, there is absolutely no benefit in imagining something that might perhaps fill the slot; it tells you exactly nothing. .... Neil Barnes

        Surely one cannot be serious and actually believe any or even all of that, Neil?

        Such would virtually tell anyone with an earnest honest interest practically everything needed to be known and not done with regard to the event.

      4. JDX Gold badge

        Re: Using "AI" to amke guesses

        I'm going to make a wild 'guess' that the people who worked on this probably know a little bit more about it than random IT people (like me) who lack the context. Otherwise they wouldn't bother doing it.

        1. Loyal Commenter Silver badge

          Re: Using "AI" to amke guesses

          I'm going to make a wild guess that they are actually academics who don't have the multi-decade real-world experience of a lot of the commenters here, so exactly the opposite of that is true.

          It's probably someone's thesis topic. "Pick an interesting problem to work on. No, it doesn't matter if that problem doesn't really exist, Knuth solved all the real ones decades ago."

          1. Wellyboot Silver badge

            Re: Using "AI" to amke guesses

            Two things spring to mind.

            Having a log just stop gives an easily spotted point of FUBAR to work back from, not many of us will appreciate trying to find the last real entry just to start the process.

            If the overall system is running well enough to be able to produce made up log entries it can B****y well punt a live message to someone saying 'X' has just packed in logging the events we were expecting.

        2. jake Silver badge

          Re: Using "AI" to amke guesses

          I'm going to make a wild guess that the people who worked on this probably know a little bit more about how to justify recieving grant money than people (like me) who got our degrees, and then got out of Uni and entered the real world. Otherwise they wouldn't bother doing it.

    2. iron Silver badge

      Re: Using "AI" to amke guesses

      Agreed. If an AI fills in the lines that should have been there then I won't find the error I'm looking for, making those logs totally worthless.

      1. Brewster's Angle Grinder Silver badge

        Synthesizing a haystack without a needle will not help you find the missing needle

        What you're looking for in a log is the exception to the rule - not the humdrum pattern.

    3. Loyal Commenter Silver badge

      Re: Using "AI" to make guesses

      As it happens, one of my many and varied dumpster fires that needed putting out today involved doing exactly this, unpicking log files from two different sources, one of which logs things happening in parallel in multiple threads, line by line, to work out the sequence of events to determine at exactly which point an API returned an internal error, to try and infer why.

      If any of those log entries had been "filled in", either with "expected content", or a timestamp from somewhere else (hint: not all sever clocks are synchronised ot the fraction of a millisecond, but the times in these logs are accurate to that degree, and entries are always written in the order that they are logged, even if they have the same time stamp), then this could very well have led me to the wrong conclusion, which, thankfully was of the "it's someone else's problem" variety.

      On the other hand, if a log is so regular that you can easily infer the order that entries should occur, even if such entries are missing, then you're not really writing a useful log. You're either writing an audit, or wasting disk space. You probably don't want to be writing software that automatically falsifies audits for you.

    4. Zippy´s Sausage Factory
      Devil

      Re: Using "AI" to amke guesses

      What's the betting this gets used in court. Someone says "someone deleted from our log file... must be hackers" and uses the AI to "rebuild" the log files. Those get submitted in court, and of course because it's AI and it's a computer it "never makes a mistake".

      I mean OK this is a bit of slippery slopeism and it probably says more about my cynical worldview than anything, but as usual we have to be careful with AI and remember that it isn't really intelligent, it just pretends to be.

      Anyway, I'm off to go and hide in my cupboard. Might do a bit of moaning and wailing later, if I feel in the mood. (Gnashing of teeth is a luxury I reserve for the weekends).

      1. Pascal Monett Silver badge

        AI doesn't pretend anything

        It's only marketing and overzealous presenters who prance about using the term and, like Tesla's "autopilot", pretend that it is something other than what it is : a statistical analysis machine.

      2. oiseau
        Facepalm

        Re: Using "AI" to amke guesses

        ... remember that it isn't really intelligent, it just pretends to be.

        Hmmm ...

        Like some boffins researching solutions for inexistent problems?

        O.

    5. oiseau
      Facepalm

      Re: Using "AI" to amke guesses

      ... inserted entries into that log file with "best guess" timing / sequence / content, it is going to be actively counter-productive.

      You are too kind.

      The net result will be rubbish.

      ... sounds like a "clever" solution for an imagined problem.

      Quite so.

      But there's been a lot of that going around in the past few years.

      Think Poettering's systemd for a good example of that.

      O.

  4. andy 103
    Stop

    correlates data *from other sources*??

    "Recreating" lines isn't really accurate then. All they're doing is getting data that has already been recorded from other sources and then trying to work out where it fits into a file with "missing" data.

    Why is time, energy and effort being spent on these bullshit activities?

    The three authors couldn't find a tool to recreate missing events. So they built one that correlates data from other relevant sources.

    If the data is already there, then the actual real problem is that some people don't know where it is.

    I can't envisage this being used in any serious or critical application. Imagine if flight data recorders worked on this premise. We'll just try and guess the sequence of events so we can put everything into 1 convenient file, rather than having the prerequisite knowledge to determine them accurately... Fuck off.

    1. JDX Gold badge

      Re: correlates data *from other sources*??

      The fact you can't understand it doesn't make it useless. It just means you don't understand it.

      You've missed the point.

      1. Loyal Commenter Silver badge

        Re: correlates data *from other sources*??

        Just because you *can* do something, does not mean that you *should*.

        I may be missing the real-world use case for this, but it sure sounds like it's an "academic problem".

        Bemoaning that others "don't understand" is more than a mite condescending. I'm pretty sure most people have understood what they are doing, we just don't know why you would want to do that.

        To most analytical minds, the absence of an entry in a log file tells us more than having a guessed-at entry filled in, in its place. Especially so, if that entry is being constructed from other data, because it's pretty obvious that if it is missing from the log file, then we should go looking to see where it actually is, and in doing so, hopefully gain an understanding of the sequence of events that has led to that situation.

        Log analysis is an art more than a science. More often than not it will consist of a process of filling in gaps ourselves to determine the sequence of events that could have happened. In doing so, this may raise other questions, which may well lead us towards a real underlying problem that needs solving. If you hand this process over to an "AI" to make a best-guess at everything, then this problem-solving process never happens.

      2. andy 103
        WTF?

        Re: correlates data *from other sources*??

        @JDX interestingly you weren't able to elaborate on what the point of it actually is. Followed by a comment from yourself 2 hours later "Struggling to see quite how this works."

        Oh dear.

    2. oiseau
      Facepalm

      Re: correlates data *from other sources*??

      Why is time, energy and effort being spent on these bullshit activities?

      Well ...

      Maybe there's good money to be made doing that?

      There's public for anything.

      O.

  5. Pirate Dave Silver badge
    Pirate

    BOFH

    Eh, did they ever think that maybe some log entries are missing for a reason? Sheesh...

    1. Antonius_Prime
      Devil

      Re: BOFH

      It's OK.

      Repeat in front of a mirror until you can say it with the most shaken, heartbroken expression you can manage (and not giggle):

      "There's been a terrible accident..."

      Apropos of nothing, anyone seen my bag of quicklime and my roll of carpet? I put them down when I went to get my print out of poorly surveiled woodland sites and building sites with deep concrete pours occurring soon...

  6. Doctor Syntax Silver badge

    It will be added to systemd in the next release.

  7. Peter Galbavy

    Use an AI guessing to train another AI and lie about "evidence". Nice. Just what some politicians need.

    Event logs are very often used as evidence - not necessarily the legal kind - to establish the sequence and timing of events, who/what was involved and responsible. Tampering with those event logs is just like any other record tampering, even if it's tied up in a nice red bow and a gift tag that says "With Love from your favourite AI".

    THen the side note about logs being used to train AIs is in itself suspicious. If you use fake records to train an AI then all you are doing is reinforcing whatever bias you decided was important to you.

    Is there a rotting fish icon?

    1. amanfromMars 1 Silver badge

      Misinformation is not a Great 0Sum Game

      Event logs are very often used as evidence - not necessarily the legal kind - to establish the sequence and timing of events, who/what was involved and responsible. Tampering with those event logs is just like any other record tampering, even if it's tied up in a nice red bow and a gift tag that says "With Love from your favourite AI”. ..... Peter Galbavy

      Tampered event logs in the West are invariably default tagged and gifted as if “From Russia with Love”

      Can you imagine the insight/foresight such a gross mischaracterisation delivers to those in the East? It tells them practically all that they need to know about the weaknesses being attacked and defeated in the West.

    2. Anonymous Coward
      Anonymous Coward

      @Peter

      just love ppl signing their thoughts with their own name, tippin my hat

  8. Scott Broukell

    But, did the events actually take place or not, were they totally imagined or virtual and, more importantly, were they socially distanced events?

    1. Anonymous Coward
      Anonymous Coward

      Iteration 1. There was no event.

      Iteration 2. There was an event, but we joined the event database and the rule database, so the event must have obeyed the rules.

      Iteration 3. There was an event that broke the rules but we weren't there.

      Iteration 4: I join your denial to the 'they all lie, all the time' axiom and hey presto: truth!

      1. Adrian 4

        What you need is to log the logging events so you can investigate why they weren't logged.

        1. jake Silver badge

          Quis custodiet ipsos data-commentariis?

    2. PerlyKing
      Go

      Re: did the events actually take place or not

      You're making it sound like the perfect application of this would be in quantum computing.

      1. Antonius_Prime
        Trollface

        Re: did the events actually take place or not

        Up until the logs get observed. Until then, they're in a state of superposition and we can't know the contents...

        1. Anonymous Coward
          Anonymous Coward

          Re: did the events actually take place or not

          So... if an event occurs, and there's no logger to record it, does the admin make a sound?

          1. Gene Cash Silver badge

            Re: did the events actually take place or not

            Yes, he says "I need another drink"

  9. JDX Gold badge

    Example?

    Struggling to see quite how this works. A visual example would be really helpful.

    1. diodesign (Written by Reg staff) Silver badge

      Re: Example?

      I've added an infographic and a link to a summary of the study by one of the universities. It basically, to me, works by figuring out what data from various sources is needed to create a log's entries, and then automating the process of generating missing entries from that data.

      C.

      1. Doctor Syntax Silver badge

        Re: Example?

        It's still no clearer exactly what they're doing because it's just a pile of jargon. It's the "recurrent event imputation" that concerns me. The nearest I can make of it is "There's usually an event of type X here but there isn't in this case so let's add one." Possibly it means something different and got lost in translation from the Korean.

        1. Polleke
          Holmes

          Re: Example?

          And this is exactly what you don't want because a missing log entry sometimes means a lot more than all the log entries together.

  10. Anonymous Coward
    Anonymous Coward

    Depends on how you're logging...

    ... but round here the *absence* of a log entry is a pretty good indication that a major, utterly disastrous, problem occurred between the last log entry and where we should see the missing log.

  11. Pascal Monett Silver badge
    Thumb Down

    "what the log entry should have been"

    That is nothing more than rewriting history.

    The absence of information can be just as significant as its presence. If something exists in one source and is absent from another, that means that there is a process that failed and could not write a log entry. For debugging purposes, that is literally more important than the pseudo re-creation of log data.

    1. PRR Silver badge

      Re: "what the log entry should have been"

      > That is nothing more than rewriting history.

      Nice work if you can get it.

      And you can get it, if you lie.

  12. jake Silver badge

    I can't tell you how many times ...

    ... that the lack of a log entry pointed out the exact issue I was tracking down.

    Sometimes what isn't there is the important thing ... this, if implemented, will be nuke on sight on any system I admin ... just like any other form of malware.

    1. amanfromMars 1 Silver badge

      What I can tell y'all about these times ...

      Sometimes what isn't there is the important thing ... this, if implemented, will be nuke on sight on any system I admin ... just like any other form of malware. ..... jake

      Things have moved on by quantum leaps and bounds, jake, into new fields of terror and/or excitement with the realisation of an enigmatic achievement which can neither be effectively attacked and gratuitously assaulted nor ever physically damaged and virtually defeated.

      Always sometimes what isn't revealed there is the important thing ... for that, whenever correctly configured and implemented, cannot fail to nuke on sight any systems administration like no other form of unknown malware or known software empowering hardware and vapourware/ponziware/zombieware

      To some who would be many is that a Doomsday 0Bug to Fear and Server, to A.N.Others and a Few a Heavenly Delight to Diabolically Savour and Favour ......... and a Present Code Red Conditioning Event to Deny is on ACTive Mission PACT Manouevres ‽ .

      And for those who may need to know* what they are trying to deny is a current situation ....say hello and welcome to Advanced Cyber Threats and Persistent ACTive CyberIntelAIgent Treats and all possible variations and reverse engineerings of those themes and memes.

      * ....Royal Chartered, £2.6bn granted, UKGBNI Cyber Security Council ??? :-) It just wouldn’t be fair on them, would it, for them to be able to plead complete ignorance of such an affair hence their being specifically singled out and highlighted in this post although one doesn’t have to be a genius or an Einstein to realise there be at least a few others worthy of mention who might wish to more fully avail themselves of such novel info and disruptive intel in order to take overwhelming advantage of its many benefits and massively utilise its myriad pitfalls/exploitable 0day vulnerabilities.

      ????? Surely you do not expect the Future to be anything like the Past and bear any responsibility for continuing woes in the Present.????? That would be to suggest madness rules, progress has not been made and evolution is halted..... which is clearly preposterous and evidently ridiculous.

  13. ibmalone

    That there real world...

    Okay, so computer scientists have an obsession with imputation that borders on the unhealthy, and we all know you can't recreate information that's not present in your starting data.

    But what about this? Compare your imputed log with the original. What's missing that's expected? What's different? That's the automated version of the exact ad-hoc process many people describe themselves using. If I ever have to go log-trawling I like to have a copy from a good state to compare to.

    It's not actually the authors' aim though, as reading the paper reveals. They basically just want to repair incomplete logs so they can carry out process mining (read, do stats on), not forensics (the word doesn't appear once in the paper). Imputation is a fairly common technique in stats, although you need to conduct sensitivity analysis to check it hasn't altered the results, it's generally used to patch up a method that can't properly handle missing data.

    Anyway, I always get suspicious when people start talking about "the real world" as if it's somewhere they inhabit that others don't. It usually just means you think your own experience is universal.

    An example: the logs they are talking about are not computer system logs as most of the replies above appear to assume. They are logs for things like container handling processes. Which answers another question, why would anyone spend money on this? "This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1A2C1102294) and research projects of ’Development of IoT Infrastucture Technology for Smart Port’ funded by the Ministry of Oceans and Fisheries, Korea."

    Yeah, those ivory tower academics with no idea about the real world.

    1. Doctor Syntax Silver badge

      Re: That there real world...

      OK, I get that. What they seem to be doing is cleaning up real world logs to present their system's best guess of what the logs should look like to train another system to spot discrepancies in real world logs.

      What could possibly go wrong?

      Scenario 1: Say they get 90% of the input doing one thing, 5% doing something else and 5 off 1% doing individual other things. They decide that the 90% is what's right, clean up the remaining 10% to look similar and train the second system on those. The second system gets more examples of the 5% & starts flagging them as errors. In fact that was a legitimate outcome but because the imputation system fudged the data the second system was mistrained. Note that in order to do its thing the imputation system must have noted these variations and could usefully have flagged these as something to be reviewed by an actual real live expert.

      Scenario 2: Same sort of results but all the discrepancies are simply failures in the logging system. The second system starts throwing errors looking at real world data because the logging system is making similar errors. The logging system is not fit for purpose and no amount of cleaning of the training data is going to fix it.

      The application area seems to be logistics. Any time I've been on the (non-)receiving end of a logistics error it's been fairly clear to me that something hasn't been scanned in or out when expected. What's missing is Real Intelligence when designing and implementing the system to raise and alarm in real time when the expected has failed to happen. No amount of Artificial Intelligence applied after the event is going to fix the problems in anything like an effective manner.

      1. ibmalone

        Re: That there real world...

        Ah, I've confused the situation here, I did write that starting suggesting (as a comparison) as an example of another way you could look at the situation (given everyone had gone straight to assuming it was for the stupidest possible application), before reading very far in the paper. Considered taking it out and possibly should have. Though I think the point still stands, even if this was a process log, you could still probably find a use for such a tool.

        In terms of the actual application, it is logistics, but they're not using it to try to fix parcel tracking, they're using it for (to my understanding) plug in to models to measure how efficiently things are moving around ports and the like, and said models really don't like if something disappeared, even if it's obvious what happened next.

    2. Loyal Commenter Silver badge

      Re: That there real world...

      I have worked in both academia, and in the world of business, and can categorically say that the source of funding for an academic project bears very little relationship to the people doing the research.

      Typically, you would have an academic of some sorts (usually a lecturer or professor) scraping around any and all funding bodies of which they can get their foot into the door, in order to beg for whatever money they can get.

      Once they get a source of cash, they then go looking for one or more researchers to do some work on it, typically PhD students, or post-doctoral researchers, but sometimes final year undergraduates if the money is thin. When the work is done, they publish a paper with their name on it, and, if they are still on speaking terms with the people who did the actual work, their names as well.

      At least, that's how it works in the UK. I would imagine that universities and funding bodies in South Korea work in much the same way.

      Now, it may well be that they are talking about shipping logs, and not log files spat out of a computer. Fundamentally, though, these are the same thing; a record of something having happened. If the record doesn't show it happened, it seems foolhardy to write a bit of software that fills in the details to make it look as if it had. If the whole point of the exercise is to "improve" the quality of low quality data for whatever purpose, then I'll refer you to the old adage of, "Garbage In; Garbage Out". If you're going to try and use that "corrected" data for any sort of statistics, then I'm sure I can find a room full of angry statisticians for you who'll chew your ear off about error bars and confidence intervals.

      1. ibmalone

        Re: That there real world...

        I'm familiar with the general process, although the description ignores that funders have increasingly specific ideas about what they want.

        Interestingly the paper quotes the exact same phrase you do. Error bars and confidence intervals exist with any data, purely from sampling if nothing else and, yes, statisticians will happily talk about imputation with caveats. One pretty basic trick you can do for example, if what you are interested in is a metric calculated from said logs, is take a set and apply random dropouts and see how that affects your metrics on the recovered copy. The authors in this paper do exactly that. Because of course "garbage in garbage out" assumes that you're starting with *garbage* rather than merely a slightly degraded signal. You can still listen to music with a few scratches on the record, yuo cna mkae sesne of thys snetnce, you can extract meaningful data from a flow logs with some missing entries. Particularly if your filling in can access other data sources for the recovery. The only data I can't extract is how that is controversial.

  14. RLWatkins

    No, no they can't.

    This reminds me of the image enhancement we see on TV: pure BS. Once one throws information away, one can't then conjure it back up from thin air.

    That's the long and short of it Everything else is just handwaving.

  15. Binraider Silver badge

    Can it do anything about the binary blob logs of systemd fame?

  16. hayzoos

    I sure hope they follow a process of creating a new synthetic/virtual log. Realistically, this is nothing new. I routinely assembled consolidated logs from individual computers in a system. Then further processed to create a synthetic log in a standardized format with expected entries calculated and inserted with a flag as a calculated entry. Individual raw lags were retained, any intermediary processed logs were retained, basically all source, in-between, and final results were available for review.

    systemd was not widely distributed when I was practicing this so was not part of my equations. Windows logs are hybrid text/tokens in native format and require appropriate dll's for messages represented by the tokens. With Windows there were so many variables in tokenization that immediate conversion to pure text was required on the machine that produced the logs. Various flavors of *nix had their own quirks as well. Text is the lowest common denominator. All timestamps were converted to "YYYY-MM-DD HH:MM:SS.####" format in zero TZ offset and from summertime/daylight saving time if required.

    As I was reading the article and comments, systemd popped into my head. Not so much from the standpoint of it's log format. Not from the possibility that this feature sounds like a perfect fit for systemd to adopt. But from the fact that systemd is so complex that it would likely to be the cause of dropped log entries in it's own and other system logs.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like