Commit 4b5f6d40 authored by 127.0.0.1's avatar 127.0.0.1 Committed by amnesia
Browse files

Add section about chi-squared test

parent 8f3f9555
......@@ -532,3 +532,41 @@ User interface
### *VeraCrypt Mounter* (optional)
<img src="https://labs.riseup.net/code/attachments/download/1842/veracrypt-mounter.png">
Detecting VeraCrypt volumes
===========================
In contrast to LUKS, VeraCrypt and TrueCrypt volumes do not have a cleartext header, but are completely encrypted (see the [VeraCrypt Volume Format Specification][]). As a result, VeraCrypt/TrueCrypt volumes cannot be distinguished from random data. This means that the best we can do is to indicate to the user that a partition / file seems to be encrypted or random data, and therefore is a candidate for being a VeraCrypt/TrueCrypt volume.
To determine whether data seems to be encrypted or random, we use [Pearson's chi-squared test][]. This test is often used to test for randomness.
When trying to determine whether a *partition* (or whole device) is a VeraCrypt/TrueCrypt volume, we don't want to read more than necessary, to avoid slowing things down too much. Because non-encrypted filesystems usually start with a header, which is very non-random, we only perform the chi-squared test on these first 512 Bytes.
The chi-squared test requires a p-value, for which to reject the hypothesis that the data is random. We choose 1/10.000.000.000 as the p-value, which means that in one of 10 billion cases, the test will issue a false negative, i.e. that the data is non-random/non-encrypted even though it actually is random/encrypted. Using the [scipy chi2 module][], we derive the following upper and lower limits for the From this p-value, we get the follwing lower and upper limits for the chi-squared value:
>>> from scipy.stats import chi2
>>> chi2.ppf([0.1**10, 1-0.1**10], 255)
array([ 136.49878495, 425.92327131])
We round these values to the nearest integer. So for chi-squared values between 136 and 425, we accept the hypothesis that the data is random/encrypted.
We will not be able to prevent false positives as effectively as false negatives. Since we treat all random-looking partitions as TrueCrypt/VeraCrypt candidates, we will definitely have false positives, because there are other use cases for random looking partitions, for example plain dm-crypt, headerless LUKS, or LoopAES partitions. This cannot be avoided, therefore we have to clearly indicate to the user that a partition is not definitely a TrueCrypt/VeraCrypt partition, but only a candidate.
We don't expect false positives for unencrypted filesystems, because the chi-squared value clearly indicates that they are not encrypted. Some examples for chi-squared values of (more or less) common filesystems, calculated with the above method:
| Filesystem | Chi-squared |
|------------|-------------|
| bfs | 113013 |
| exfat | 115672 |
| ext2 | 130560 |
| ext3 | 130560 |
| ext4 | 130560 |
| fat | 56629 |
| minix | 130560 |
| ntfs | 61937 |
| vfat | 56651 |
[VeraCrypt Volume Format Specification]: https://veracrypt.codeplex.com/wikipage?title=VeraCrypt%20Volume%20Format%20Specification
[Pearson's chi-squared test]: https://en.wikipedia.org/wiki/Chi-squared_test
[scipy chi2 module]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment