8. Detection Techniques
Techniques used for detecting malware can be categorized broadly into three categories:
i) Signature-based detection
ii) Behaviour-based detection
iii) Anomaly-based detection
Signature-based detection uses its characterization of what is known to be malicious to decide the maliciousness of a program under inspection. As one may imagine this characterization or signature of the malicious behaviour is the key to a signature-based detection method’s effectiveness.
Behaviour-based detection techniques focus on analyzing the behaviour of known and suspected malicious code. Such behaviours include factors such as the source and destination addresses of the malware, the attachment types in which they are embedded, and statistical anomalies in malware infected systems. One example of a behaviour-based detection approach is the histogram-based malicious code detection technology patented by Symantec.
An anomaly-based detection technique uses its knowledge of what constitutes normal behaviour to decide the maliciousness of a program under inspection. A special type of anomaly-based detection is referred to as specification-based detection.
Specification-based techniques leverage some specification or rule set of what is valid behaviour in order to decide the maliciousness of a program under inspection. Programs violating the specification are considered anomalous and usually, malicious.
Each of the detection techniques can employ one of three different approaches:
1. Static analysis uses syntax or structural properties of the program (static)/process (dynamic) under inspection (PUI) to determine its maliciousness. For example, a static approach to signature-based detection would only leverage structural information (e.g. sequence of bytes) to determine the maliciousness.
2. On the other hand, a dynamic approach will leverage runtime information (e.g. systems seen on the runtime stack) of the PUI. In general, a static approach attempts to detect malware before the program under inspection executes. Conversely, a dynamic approach attempts to detect malicious behaviour during program execution or after program execution.
3. The hybrid techniques that combine the two approaches.
8.1 Anomaly-based Detection
Anomaly-based detection usually occurs in two phases–a training (learning) phase and a detection (monitoring) phase. During the training phase the detector attempts to learn the normal behaviour. The detector could be learning the behaviour of the host or the PUI or a combination of both during the training phase. A key advantage of anomaly-based detection is its ability to detect zero-day attacks. Weaver, et al. describes zero-day exploits. Similar to zero-day exploits, zero-day attacks are attacks that are previously unknown to the malware detector. The two fundamental limitations of this techniques its high false alarm rate and the complexity involved in determining what features should be learned in the training phase.
Li et al. describe Fileprint (n-gram) analysis as a means for detecting malware. During the training phase, a model or set of models are derived that attempt to characterize the various file types on a system based on their structural (byte) composition. These models are derived from learning the file types the system intends to handle. The authors’ premise is that benign files have predictable regular byte compositions for their respective types. So for instance, benign .pdf files have a unique byte distribution that is different from .exe or .doc files. Any file under inspection that is deemed to vary “too greatly” from the given model or set of models, is marked as suspicious. These suspicious files are marked for further inspection by some other mechanism or decider to determine whether it is actually malicious.
Short Sequences of System Calls
Hofmeyr et al. propose a technique that monitors system call sequences inorder to detect maliciousness. First, profiles must be developed that represent the normal behavior of the system’s services. “Normal” in this technique is defined in terms of short sequences of system calls. Although intrusions may be based on other parameters, these other parameters are ignored. Hamming distance is used to determine how closely a system call sequence resembles another. A threshold must be set to determine whether a process is anomalous. Typically, processes showing large Hamming distance values are anomalous. Hofmeyr et al.’s method was able to find intrusions that attempted to exploit various UNIX programs like sendmail, lpr, and ftpd.
FSA for Detecting Anomalous Programs
Sekar et al. created a Finite State Automata (FSA) based approach to anomaly detection. Each node in the FSA represents a state (program counter) in the PUI which the algorithm utilizes to learn normal data faster and perform better detection. Transitions in the FSA are given by system calls. In order to construct the FSA, the program is executed multiple times. When a system call is invoked, a new transition is added to the automaton. The automaton resulting from the multiple executions will be what the algorithm considers normal. During runtime, system calls are intercepted, and the program’s state recorded. If an error occurs in doing this, then an anomaly has occurred. Next, the algorithm checks for a valid transition from the FSA’s current state to the newly invoked system call. If no such transition exists then there is an anomaly. If the previous two steps were successful, then the FSA is transitioned to the next state.
Wang et al. propose a method for detecting a type of malware they refer to as “ghostware.” Ghostware is malware that attempts to hide its existence from the Operating System’s querying utilities. This is typically done by intercepting the results for these queries and modifying them so traces of the ghostware could not be found/ detected via API queries. For example, if a user performs a command to list the files in the current directory, “dir,” the ghostware would remove any of its resources from the results returned by the “dir” command.
Wang et al. offer a “cross-view diff-based” approach to detecting these types of malware. In addition to this approach they offer two ways of scanning for the malware, one being an inside-the-box approach and the other an outside-the-box approach. Since there are many layers that return values, and actual arguments must pass through them when a system call is made, many opportunities are afforded to ghostware to intercept function calls. The authors’ proposed method to counter this vulnerability will compare the results from a high-level system call like “dir” to a low-level access of the same data without using a system call. An example of a low-level access may be accessing the Master File Table (MFT) directly. This described process is considered the “cross-view diff-based” approach.
The inside-the-box approach mandates that the comparison of the high-level and low-level results is within the same machine. However, one may imagine that the ghostware may compromise the entire Operating System in which case the low-level scan cannot be trusted. Hence, an alternative to the inside-the-box approach is the outside-the-box approach.
In the outside-the-box approach, another (clean) host performs the low-level access without the target host’s knowledge. The high-level scan of the target host is compared to the low-level scan from the clean host. If there is any difference between the low-level or high-level scans in either the inside-the-box or outside-the-box approach then ghostware is present in the target host.
In detecting file-hiding ghostware (e.g. ProBot SE and Aphex), the inside-the-box approach did not produce any false positives. However, the outside-the-box approach did produce some false positives for the 10 ghostware the authors used in their experiments. For registry-hiding ghostware (e.g. Hacker Defender 1.0 and Vanquish), virtually no valid false positives were found. The one false positive found, over the six ghostware used in this experiment, was fixed with a minor change. For process/module hiding ghostware (e.g. Berbew and FU), no false positives were found for the four ghostware used in the experiment. The authors do acknowledge that it is possible for false positives to occur for this type of ghostware, but did not see any during experimentation.
8.2 Specification-based Detection
Specification-based detection is a type of anomaly-based detection that tries to address the typical high false alarm rate associated with most anomaly-based detection techniques. Instead of attempting to approximate the implementation of an application or system, specification-based detection attempts to approximate the requirements for an application or system. In specification-based detection, the training phase is the attainment of some rule set, which specifies all the valid behaviour any program can exhibit for the system being protected or the program under inspection. The main limitation of specification-based detection is that it is often difficult to specify completely and accurately the entire set of valid behaviours a system should exhibit. One can imagine that even for a moderately complex system, the complete and accurate specification of its valid behaviours can be intractable. Even when it may be straight-forward to express specifications for a system in natural language, it is often times difficult to express this in a form amenable for a machine.
Monitoring Security-Critical Programs
Ko et al. proposed a specification-based method for detecting maliciousness in a distributed environment. An implementor would specify the trace policy for the system. A trace is simply an ordered sequence of execution events, which are essentially the system calls recorded by the auditing mechanism. Ko et al. created a parallel environment (PE) grammar to address synchronization issues in distributed programs. Audit trails are parsed in real time.
The authors developed an implementation of their approach called Distributed Program Execution Monitor (DPEM). Trace policies were created for 15 Unix programs. One trace policy was created for the rdist program. DPEM was able to catch two violations present in an attack on the rdist program. The violation signal occurred approximately .06 seconds after the actual violation. The authors observed similar delay when they created a trace policy involving passwd and vi. Intrusion attempts on sendmail and binmail were also detected by DPEM in .1 seconds.
Automated Detection of Vulnerabilities in Privilege Programs
Ko et al. present a specification language for specifying the intended behaviour of privileged programs. This technique relies on the Operating System to generate audit trails that are then used in the validation of program behavior. Auditing is the process of logging “interesting” activities, which in this case is whenever a system call is invoked. Based on a program’s specification, its runtime behaviour can be decided to be malicious or not. The program’s specification is translated into audit-trails that is compared to the audit-trails of the PUI. The audit-trails of the PUI are captured by the Operating System.
A potential disadvantage of this technique is that the malware would be detected after the attack. Another potential disadvantage is that the technique can only be as granular as the Operating System’s auditing mechanism. No empirical study of the effectiveness of this approach was given.
Rabek, et al. offer a technique called DOME (Detection Of Malicious Executables). DOME was designed to detect injected, dynamically generated, and obfuscated code. DOME is characterized by two steps. In the first step, DOME statically preprocesses the PUI. Preprocessing consists of (1) saving system call addresses, (2) their names, and (3) the address of the instruction directly following each system call. The third component saved by DOME is the return addresses for system calls in the executable.
In the second step, DOME monitors the executable at runtime, ensuring that all system calls made at runtime match those recorded from the static analysis performed in the first step. The API is instrumented with pre-stub and optionally post-stub code at load time. The pre-stub code ensures that items 1 – 3 from the preprocessing stage match what is seen at execution time. In the proof of concept study conducted by Rabek et al. they found that DOME was able to detect all system calls made by the malicious code.
SPiKE, is a framework designed to help users monitor the behavior of applications, with the goal of finding malicious behavior. The monitoring is done by instrumenting Operating System services (system calls). The authors claim that “most if not all malware are sensitive to code modification.” If this statement is true, then in order for binary instrumentation to be useful, it must be more stealthy to malware. For example, instrumentation introduces abnormal latency for a system call. Malware may request for the real-time clock time and ascertain that it is being monitored because system calls are taking too long to complete. SPiKE combats this or hides itself by applying a clock patch such that any requests will resemble a time closer to what the malware would expect (i.e. the real-time clock would reflect a time that would be consistent with the time generated by the real-time clock had normal execution of the system call taken\place). SPiKE allows for instrumentation anywhere in an executable with the use of “drifters.” A drifter can be described by its 2 components: a code-breakpoint and an instrument. Code-breakpoints are implemented by setting the “not-present” attribute of the to-be instrumented memory location’s virtual page. Once the page fault exception is raised, an opportunity for stealth instrumentation avails itself. The instrument component of a drifter is simply the monitoring (or whatever functionality) the user desires to perform.
In the assessment of Vasudevan and Yerraballi, SPiKE appeared to successfully track the W32.MyDoom Trojan, which is an intelligent malware instance that can identify traditional binary instrumentation.
8.3 Behaviour-based Detection
For antivirus scanners to detect malware before it has been studied they must perform some sort of automatic analysis themselves. One solution to this is Behaviour-based detection. As the name suggests, this technique monitors the runtime behaviour of the program because no matter the disguise, a piece of malware will behave badly, that is its purpose. One technique for analysing the behaviour of a program is to study the sequence of operating system calls it makes. Antivirus software can intercept these API calls while a program is running, and use heuristics to look for suspicious activity, terminating those with harmful behaviour. Various heuristics have been researched, such as looking for patterns used for self replication, but these all rely on monitoring a program once it is running. This is dangerous to rely on because the malware might cause harm to the system before it is recognized as malicious.
Alternatively we can adapt this approach to scan a program by observing its execution it in a virtual environment. This is done through “Dynamic Binary Translation” which translates and caches binary code, replacing API calls so that they modify virtual resources rather than the real system. This gives fast execution and also safely isolates the program and allows intimate observation of its activities.
CWSandbox has been developed as a tool for AV analysts and uses a similar method. It does use a virtual machine using DBT, but the analyser software itself is also executed in the virtualized environment and uses inline code overwriting to hook the API functions which could allow malware to detect analysis, and change behaviour to avoid detection.
Signature-based detection attempts to model the malicious behavior of malware and uses this model in the detection of malware. The collection of all of these models represents signature-based detection’s knowledge. This model of malicious behavior is often referred to as the signature.
Ideally, a signature should be able to identify any malware exhibiting the malicious behavior specified by the signature. Like any data that exists in large quantities which requires storage, signatures require a repository. This repository represents all of the knowledge the signature-based method has, as it pertains to malware detection. The repository is searched when the method attempts to assess whether the PUI contains a known signature.
Currently, we primarily rely on human expertise in creating the signatures that represent the malicious behavior exhibited by programs. Once a signature has been created, it is added to the signature-based method’s knowledge (i.e. repository). One of the major drawbacks of the signature-based method for malware detection is that it cannot detect zero-day attacks, i.e. an attack for which there is no corresponding signature stored in the repository.
Generic Virus Scanner
Kumar and Spafford proposed a general scanner which detected viruses based on regular expression matching. At each nibble in an input stream (e.g. a file) being scanned, the pattern matching algorithm compares all known viruses matching this nibble value to see if the input stream sequence matches a known virus. Because the scanner proposed was made exclusively for SunOS, the authors’ implementation was, consequently, not easily comparable to other virus scanners of different Operating Systems and file systems.
Kreibich and Crowcroft proposed honeycomb, which is a system that uses honeypots to generate signatures and detect malware stemming from network traffic. The authors’ technique operates under the assumption that traffic which is directed to a honeypot is suspicious. Honeycomb stores the information regarding each connection, even after the connection has been terminated. The number of connections it can save is limited. The reassembled stream of the connection is stored. The Longest Common Subsequence (LCS) algorithm is used to determine whether a match is found between connections stored and new connections that honeycomb receives. Since honeycomb needs a set of signatures to compare to, initially it uses anomalies in the connection stream to create signatures. For example, if odd TCP flags are found, then a signature will be generated for that stream. The signature is the stream that came into honeycomb modulo honeycomb’s responses to the incoming stream. To detect malicious code, horizontal and vertical detection schemes are used. In the horizontal approach, the last (nth) message of an incoming stream is compared to the nth message of all streams stored by honeycomb. In the vertical approach, the messages are aggregated, and the LCS algorithm is run on the aggregation of the newly arrived stream as well as the aggregated form of the streams stored by honeycomb. Signatures that are not used much are deleted from the queue of signatures. Also if a new signature is found to be the same, or a subset of an already existing signature, it is not added to the signature pool. This helps keep the signature pool as small as possible. At regular intervals signatures found are reported and logged to anothermodule. In Kreibich and Crowcroft’s empirical study, they were able to develop “precise” signatures for the Slammer and CodeRed II worms.
 Tristan Aubrey-Jones, Behaviour Based Malware Detection, 2007