Machine learning for the fast verification of contested antivirus decisions

Luke

5 months ago

Applying machine learning to quickly validate disputed antivirus decisions. In order to offer optimal protection, Avast utilizes an intricate network of detection methods. A series of tests are typically completed in a nanosecond or less whenever a file enters or is run on your computer. Most malicious files are found with quick detectors that leave a minimal trace on your machine. But occasionally, these detectors are unable to accurately evaluate a file (especially if it is brand-new and distinct). The cloud backend saves the day in these situations because it can do in-depth, comprehensive checks that leverage all of Avast’s expertise. Such thorough testing would be too much for a standalone computer with antivirus software installed, as it requires too much processing power and data. The interaction between the antivirus engine on the client side and its cloud backend guarantees that every file is detected as accurately as possible.

Machine learning for the fast verification of contested antivirus decisions

In this article, you can know about Applying machine learning here are the details below;

However, even with the highest level of accuracy, there are times when we identify a file as dangerous and block it, even though the user knows or believes otherwise. There is a “grey zone” of files that can occasionally accomplish dangerous things even though they are not intended to; these are either newly created innocuous files that just so happen to resemble known malware, or files that use common malicious tactics for benign purposes. It is understandable that a user would protest from time to time if their antivirus program disables an application by blocking a file. In these situations (regardless of how uncommon they are), the AV might be in error, but it’s also possible that the user made a mistake. In any case, we give users the choice to contest the blocking measure our antivirus has applied. Users can accomplish this by using the dialog shown in Figure 1.

ALSO READ What Is the Digital Economy? An Overview of the Crypto Ecosystem

Figure 1: Users can give us files for analysis based on our file-blocking action.

User-submitted reports are gathered into a queue, and a customized automated analytic process is used to reevaluate each contested decision. Although this type of focused file analysis can be more thorough than that of typical automatic antivirus programs, it can also be laborious and necessitate human analyst participation.

Numerous expert systems make up the pipeline itself. These consist of both publicly accessible platforms like the sandbox, which offers both static and dynamic examination of a given binary, and our own custom tools. As you may expect, all of these methods produce additional structured data in addition to a binary categorization. As seen in Figure 2, this can comprise data like file descriptors, certificate information, or just the decomposition of classification.

Figure 2 shows an example of a well-organized system output from among our many. Additional data, including timestamps, certifications, behavioral patterns, etc., can be obtained from other systems.

A sophisticated decision function, which combines the outputs of all the separate systems to provide a final conclusion, is the last stage of the pipeline. These binaries cannot be automatically resolved in some complex circumstances where the information at hand may not be conclusive. These instances have binaries in a manual queue that need to be reviewed by one of our analysts.

In the past, the decision-making process was a straightforward weighted majority vote. Since then, it has been improved to handle progressively more complex situations, like particular malware strains. The number of systems in the pipeline has grown dramatically since it was deployed, as is customary, and this has led to a rise in the complexity of the final judgment function. The function’s manual rule curation process eventually lost its effectiveness. The pipeline had been in operation for a considerable amount of time and had accumulated a respectable number of prior judgments and binaries. As a result, we could now go one step farther and learn the function from the data, something that was not possible before.

ALSO READ What is Cryptocurrency and How Does it Benefit Me?

As previously mentioned in a blog post, Avast is creating a cutting-edge machine learning platform with the goal of automating pipelines for data processing. Loosely structured data modeling is possible using this platform. To be more precise, it can:

Determine the input data’s schema.
Suggest projecting non-numerical qualities as vectors or scalar numbers.
Create a neural network model that reflects the data’s structure.
Execute inference and training.

Typically, much manual feature engineering is needed to train a classifier on highly nested tree-shaped data, like the one we have. Furthermore, feature extractor implementation would need to be sustained throughout time. Because our machine learning platform incorporates implicit feature engineering, numerization, and schema inference during neural network training, using it significantly streamlines this procedure.

For our particular use case, which involves training a meta-classifier on top of a collection of quickly changing expert systems, all of this makes the platform an excellent fit. The trained decision function has already been put to use for a while in production. A significant data movement happened during the period. Because the platform supports the function, we don’t have to be as concerned about abrupt changes in the input data’s structure.

We can now confidently state that the pipeline’s overall performance has much improved as a result of the new decision function’s deployment. Our goal is to promptly and confidently address as many possible false reports as we can. A daily percentage of complaints that we could resolve right away without adding to the manual queue is displayed in Figure 3. Take note of the graph’s transition point on July 7, 2021. By implementing the decision function, the percentage of reports that are automatically handled has nearly doubled, from 38% to 74%.

Figure 3. The ratio of automatically resolved investigations of reported binaries over time is plotted in this graph. On July 7, 2021, take note of the transition point. It demonstrates a clear throughput improvement and aligns with the deployment of the automated decision maker. The machine learning model’s ability to extract more subtle patterns from the data than a manually designed system could is what accounts for this throughput boost. Also check wayback machine alternatives

ALSO READ Top 15 Best Invoicing Software In 2024

But that’s not all. There are many benefits to using our machine learning platform. We have built a framework on top of Mill.jl, as detailed in our previous blog article on explainability in machine learning, to aid in our understanding of the specific choices made by each of our models. It accomplishes this by applying a game theoretical technique to estimate the cumulative relevance of each unique input feature found in our dataset. We will demonstrate how we use this architecture to enrich the user interface (UI) of a decision support system that gives analysts access to outputs from the previously described individual systems in a future post.

In summary, Avast receives a constant stream of possible false positives from its users, to which it must respond quickly. In order to accomplish this, we have a specialized pipeline that analyzes each reported binary and makes an automated determination of whether or not it is clean. In certain circumstances, the final decision must be made by analysts. We were able to replicate the previous decision function’s functionality and extract more patterns from the data by learning the decision function from the data. We are able to automatically resolve the 1.8x increase in samples that resulted from that. Because the selections don’t have to remain in a manual queue for an analyst to act, this consequently greatly reduced the average time to reaction.