In part 1 we looked at parsing bencoded data. Let’s put this to good use and start looking at .torrent
files - we’ll finish by creating our own (from scratch!).
We’re building up from the go-bt repository. Clone it if you want to follow along.
The .torrent
file
At a basic level the .torrent
file contains 2 types of information. The first is enough information (in the form of an announce
key) for the client to connect to a tracker and query for peers, and details on the content (primarily its length, the number of pieces and hashes).
Note: not all torrents rely on trackers. Trackerless torrents using something called DHT - and do not have an announce
key in the torrent dictionary - we may revisit this at some point but not right now
Let’s parse a sample torrent file for Ubuntu:
/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=./bencode/testdata/ubuntu.torrent 2>&1 | cut -c 1-80
{
"announce": "https://torrent.ubuntu.com/announce",
"announce-list": [
[
"https://torrent.ubuntu.com/announce"
],
[
"https://ipv6.torrent.ubuntu.com/announce"
]
],
"comment": "Ubuntu CD releases.ubuntu.com",
"created by": "mktorrent 1.1",
"creation date": 1677174459,
"info": {
"length": 1975971840,
"name": "ubuntu-22.04.2-live-server-amd64.iso",
"piece length": 262144,
"pieces": "*8\ufffd7R\ufffd\ufffdEُ\ufffdp܊\ufffd\ufffd3\ufffd\u00087MV\ufff
}
}
(output is truncated to 80 characters - the pieces
key contains the concatenated hash of each piece which gets well, big).
The announce
key tells us where the tracker lives (c.f. next section). We’ll connect to the tracker to source peers primarily. The other fields are self-explanatory and are primarily for the benefit of the user.
The info
map inside the top-level one contains details about the actual contents of the torrent - all the juicy bits are here. In the example above we have:
length
: the number of bytes in the filepiece length
: number of bytes in each piecepieces
: the SHA1 sum of each piece
To find the total number of pieces we divide length
by piece length
and round up to the next integer:
/V/r/g/src ❯❯❯ python3 -c 'import sys, math;print(math.ceil(1975971840/262144))'
7538
pieces
contains the 20-bytes SHA1 digest of each piece concatenated together - which should be 7538*20=150760
bytes. Sure enought, that’s exactly the length of the pieces
string in the torrent file (right at the end):
/V/r/g/src ❯❯❯ cat ./bencode/testdata/ubuntu.torrent | fold -w 80
d8:announce35:https://torrent.ubuntu.com/announce13:announce-listll35:https://to
rrent.ubuntu.com/announceel40:https://ipv6.torrent.ubuntu.com/announceee7:commen
t29:Ubuntu CD releases.ubuntu.com10:created by13:mktorrent 1.113:creation datei1
677174459e4:infod6:lengthi1975971840e4:name36:ubuntu-22.04.2-live-server-amd64.i
so12:piece lengthi262144e6:pieces150760:*8
(if you’re wondering why we’re not using jq
for this, it’s because in the JSON output the binary string has been encoded, presumably in a UTF variant - which also explains why fold
is essentially truncating the output because there must be a null byte somewhere in that giant string)
Things are a little different for torrents containing multiple files! Let’s take the .torrent
for QubeOS:
/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/Users/axiomiety/Downloads/Qubes-R4.2.2-x86_64.torrent | cut -c 1-80
{
"announce": "udp://tracker.torrent.eu.org:451",
"announce-list": [
[
"udp://tracker.torrent.eu.org:451"
],
[
"udp://tracker.opentrackr.org:1337/announce"
],
[
"https://tracker.gbitt.info:443/announce"
],
[
"http://tracker.gbitt.info:80/announce"
]
],
"created by": "mktorrent 1.1",
"info": {
"files": [
{
"length": 6897252352,
"path": [
"Qubes-R4.2.2-x86_64.iso"
]
},
{
"length": 1251,
"path": [
"Qubes-R4.2.2-x86_64.iso.DIGESTS"
]
},
{
"length": 833,
"path": [
"Qubes-R4.2.2-x86_64.iso.asc"
]
}
],
"name": "Qubes-R4.2.2-x86_64",
"piece length": 1048576,
"pieces": "ɱ\ufffd]aG\ufffd\ufffd\ufffd\ufffda\ufffd7\u003cru\ufffd.bK\ufff
},
"url-list": [
"https://mirrors.kernel.org/qubes/iso/",
"https://ftp.qubes-os.org/iso/"
]
}
The info
map still contains the familiar piece length
and pieces
keys, with the addition of a new files
list that contains both the path
and length
of each individual file.
A quick calculation shows we should have:
>>> math.ceil((6897252352+1251+833)/1048576)
6578
pieces, and we therefore expect pieces
to be of length 6578*20=131560
:
/V/r/g/src ❯❯❯ cat ~/Downloads/Qubes-R4.2.2-x86_64.torrent | fold -w 80
d8:announce32:udp://tracker.torrent.eu.org:45113:announce-listll32:udp://tracker
.torrent.eu.org:451el42:udp://tracker.opentrackr.org:1337/announceel39:https://t
racker.gbitt.info:443/announceel37:http://tracker.gbitt.info:80/announceee10:cre
ated by13:mktorrent 1.14:infod5:filesld6:lengthi6897252352e4:pathl23:Qubes-R4.2.
2-x86_64.isoeed6:lengthi1251e4:pathl31:Qubes-R4.2.2-x86_64.iso.DIGESTSeed6:lengt
hi833e4:pathl27:Qubes-R4.2.2-x86_64.iso.asceee4:name19:Qubes-R4.2.2-x86_6412:pie
ce lengthi1048576e6:pieces131560:ɱ
An interesting caveat in all this is that a single block can contain bytes belonging to more than one file! Take this example:
file1=120
file2=40
file3=30
Total size is 120+40+30=190
. If the block size was 100, block 2 would contain the last 20 bytes of file
, along with the bytes of file2
and file3
. It’s a tiny detail but it does emphasise how each block is important (we’ll revisit this when requesting blocks from peers, especially with respect to block sizes).
Creating our own .torrent
To make sure we understand how this works, here’s a short snippet that creates 3 files with pseudo-random data (I appreciate it isn’t in go
but heh):
This should allow you to re-create the same files on your machine - here are the SHA1 digests:
/t/files ❯❯❯ python3 gen-random-files.py
/t/files ❯❯❯ ls -lh file*
-rw-r--r--@ 1 axiomiety wheel 6.7M Aug 30 15:25 file1
-rw-r--r--@ 1 axiomiety wheel 1.9M Aug 30 15:25 file2
-rw-r--r--@ 1 axiomiety wheel 2.9M Aug 30 15:25 file3
/t/files ❯❯❯ shasum file*
758d2401caa0d71d71cffd84d8491c6b07a5cb5f file1
2035dbcd7c76b22f3112426ceebffe75117af26d file2
6149596f744de4098ec1d43dc1999cc4c32a40a0 file3
A quick back-of-the-envelope calculation shows that for the files above, using a block size of 64 KiB, we should have a total of 184 blocks - so 184 digests:
We can use split
to simulate each block, just to ensure we’re computing the hashes correctly. We’re particularly interested in ensuring we handle file and block boundaries correctly. For argument’s sake let’s print the digests of the first and last 3 blocks:
/t/files ❯❯❯ cat file* | split -b 65536
/t/files ❯❯❯ ls -1 x* | wc -l
184
/t/files ❯❯❯ ls
file1 xad xaj xap xav xbb xbh xbn xbt xbz xcf xcl xcr xcx xdd xdj xdp xdv xeb xeh xen xet xez xff xfl xfr xfx xgd xgj xgp xgv xhb
file2 xae xak xaq xaw xbc xbi xbo xbu xca xcg xcm xcs xcy xde xdk xdq xdw xec xei xeo xeu xfa xfg xfm xfs xfy xge xgk xgq xgw
file3 xaf xal xar xax xbd xbj xbp xbv xcb xch xcn xct xcz xdf xdl xdr xdx xed xej xep xev xfb xfh xfn xft xfz xgf xgl xgr xgx
xaa xag xam xas xay xbe xbk xbq xbw xcc xci xco xcu xda xdg xdm xds xdy xee xek xeq xew xfc xfi xfo xfu xga xgg xgm xgs xgy
xab xah xan xat xaz xbf xbl xbr xbx xcd xcj xcp xcv xdb xdh xdn xdt xdz xef xel xer xex xfd xfj xfp xfv xgb xgh xgn xgt xgz
xac xai xao xau xba xbg xbm xbs xby xce xck xcq xcw xdc xdi xdo xdu xea xeg xem xes xey xfe xfk xfq xfw xgc xgi xgo xgu xha
/t/files ❯❯❯ shasum x* | head -3
294faa783957ea41ff3641f1b52fa85cf4bec89a xaa
e96b680d153fbfc9cde2239f80f956d3d8f3c183 xab
f60bbfb3eb95d95faea7586887a22765179ea8d0 xac
/t/files ❯❯❯ shasum x* | tail -3
86c24415d0d187c593cf25518e12b0ff0e878e82 xgz
4a2f5281f9d6f4d682b59d6f5b685b4580116d3d xha
c86b8537f69e8f48ce338ee264ba56927f4b9f79 xhb
In a nutshell we will:
- create a buffer the size of a block
- fill that buffer by reading raw bytes from the given files 1. if we reach EOF before filling up a buffer and there are more files to be read, continue
- once the buffer is full, calculate the SHA1 digest and append it to
pieces
There are probably more efficient ways to do it but this works a treat:
With the meat of it done, we can simply create the overall structure and call into the above. We just need to ensure that if multiple files are passed in, we set the info.files
accordingly. Note that the order in which the files are listed mirror how pieces are calculated (so if you changed the order in the torrent, you’d expect pieces
to change):
Let’s see what we get:
/V/r/g/src ❯❯❯ go run ./main.go create -announce http://localhost:8088 -name foo -pieceLength 65536 -out /tmp/files.torrent /tmp/files/file*
294faa783957ea41ff3641f1b52fa85cf4bec89a
e96b680d153fbfc9cde2239f80f956d3d8f3c183
f60bbfb3eb95d95faea7586887a22765179ea8d0
...
86c24415d0d187c593cf25518e12b0ff0e878e82
4a2f5281f9d6f4d682b59d6f5b685b4580116d3d
c86b8537f69e8f48ce338ee264ba56927f4b9f79
Wohoo it’s a match!
Now whilst the above makes me feel quite confident about the implementation, let’s do a final check by validating against a known implementation. Here’s a torrent I created using the qBittorrent client, containing those same 3 files:
/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/tmp/files.torrent | jq -r '.info.pieces' | shasum
b6d75c38ab9d2c526cba79570b4a7edadbf3767a -
/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/tmp/qbittorrent.torrent | jq -r '.info.pieces' | shasum
b6d75c38ab9d2c526cba79570b4a7edadbf3767a -
It isn’t super scientific but it does show a match :)
Taking it further
One thing we’ll need later is the ability to read arbitrary blocks - e.g. if a client requires block 43, we have to compute the offset and grab the corresponding bytes, which could straddle files.
Another useful addition would be to validate the pieces
section - quite a few clients will do this at start-up, to mark the downloaded blocks as good or bad. It’s also useful to do once a block has been downloaded to ensure (1) we received the data correctly and (2) we weren’t poisoned by a malicious peer.