check word file against virus – antivirus check specific to word file

http://ifx0.com/word_forensics.html 

check entropy as well

http://ifx0.com/context_aware_auditor.html

how it works

The HTML file you are building is a client-side static analysis tool. This means all the “detective work” happens inside your web browser using JavaScript, without sending the file to a server.

Here is an explanation of how it works and how it identifies potential threats:

1. How it works: The “Unzipping” Approach

Modern Word files (.docx) are not single files; they are actually ZIP archives. If you rename a .docx file to .zip and open it, you will see a collection of XML files and folders.

  • JSZip Library: Your code uses the JSZip library to open these files in the browser’s memory without actually extracting them to your hard drive.

  • Structure Traversal: Once “unzipped” in memory, the script loops through the internal file structure (fileEntries) to look for specific files and patterns known to be used by malicious actors.

2. How it detects potential viruses

Because this is a static analysis tool, it doesn’t “scan” for specific virus signatures (like an antivirus program does). Instead, it looks for indicators of suspicion (heuristics) that suggest a file might be weaponized:

A. Macro Detection (vbaProject.bin)

Macros are small programs embedded in Word documents to automate tasks. However, they are also the primary delivery method for malware.

  • The check: The script looks for a file named vbaProject.bin inside the ZIP structure. If it finds this file, it flags the document as HIGH RISK because macros can execute arbitrary code on a system when opened.

B. Embedded OLE Objects

OLE (Object Linking and Embedding) allows you to embed other files (like Excel sheets or images) inside a Word doc. Attackers use this to hide malicious binaries that trigger when the object is interacted with.

  • The check: The script searches the word/embeddings/ folder for .bin files. A high count or large files in this area are considered suspicious.

C. Size-to-Content Analysis (Heuristics)

Malware authors often hide malicious code by “padding” a document with invisible data.

  • The check: The tool extracts the visible text from document.xml. If the file is 5MB in size but only contains 50 characters of text, the ratio is highly suspicious. This discrepancy suggests the file is “hollow”—meaning the bulk of the file is likely hidden code or data rather than actual document content.

D. Trailing Byte Analysis

When a legitimate ZIP file ends, the “End of Central Directory” record marks the end of the data.

  • The check: Your script looks for this marker. If there is data after this marker, it means someone has appended data to the end of the file. This is a common technique used to hide “payloads” that aren’t part of the document structure but are executed by the exploit.

Limitations (The “Static” Reality)

It is important to remember what this tool cannot do:

  • It cannot execute the code: It only looks at the structure. It cannot see what the macro will do once it runs.

  • It is not a Sandbox: True malware analysis involves running the file in a “sandbox” (an isolated virtual machine) to see what network calls it makes and what files it tries to modify. This tool only looks for the “DNA” of the file.

Safety Note: Never open a document if this tool flags it as suspicious.

Leave a Reply