In part 1 we looked at parsing bencoded data. Let’s put this to good use and start looking at .torrent files - we’ll finish by creating our own (from scratch!).

We’re building up from the go-bt repository. Clone it if you want to follow along.

The .torrent file

At a basic level the .torrent file contains 2 types of information. The first is enough information (in the form of an announce key) for the client to connect to a tracker and query for peers, and details on the content (primarily its length, the number of pieces and hashes).

Note: not all torrents rely on trackers. Trackerless torrents using something called DHT - and do not have an announce key in the torrent dictionary - we may revisit this at some point but not right now

Let’s parse a sample torrent file for Ubuntu:

/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=./bencode/testdata/ubuntu.torrent 2>&1 | cut -c 1-80
{
  "announce": "https://torrent.ubuntu.com/announce",
  "announce-list": [
    [
      "https://torrent.ubuntu.com/announce"
    ],
    [
      "https://ipv6.torrent.ubuntu.com/announce"
    ]
  ],
  "comment": "Ubuntu CD releases.ubuntu.com",
  "created by": "mktorrent 1.1",
  "creation date": 1677174459,
  "info": {
    "length": 1975971840,
    "name": "ubuntu-22.04.2-live-server-amd64.iso",
    "piece length": 262144,
    "pieces": "*8\ufffd7R\ufffd\ufffdEُ\ufffdp܊\ufffd\ufffd3\ufffd\u00087MV\ufff
  }
}

(output is truncated to 80 characters - the pieces key contains the concatenated hash of each piece which gets well, big).

The announce key tells us where the tracker lives (c.f. next section). We’ll connect to the tracker to source peers primarily. The other fields are self-explanatory and are primarily for the benefit of the user.

The info map inside the top-level one contains details about the actual contents of the torrent - all the juicy bits are here. In the example above we have:

  • length: the number of bytes in the file
  • piece length: number of bytes in each piece
  • pieces: the SHA1 sum of each piece

To find the total number of pieces we divide length by piece length and round up to the next integer:

/V/r/g/src ❯❯❯ python3 -c 'import sys, math;print(math.ceil(1975971840/262144))' 
7538

pieces contains the 20-bytes SHA1 digest of each piece concatenated together - which should be 7538*20=150760 bytes. Sure enought, that’s exactly the length of the pieces string in the torrent file (right at the end):

/V/r/g/src ❯❯❯ cat ./bencode/testdata/ubuntu.torrent | fold -w 80
d8:announce35:https://torrent.ubuntu.com/announce13:announce-listll35:https://to
rrent.ubuntu.com/announceel40:https://ipv6.torrent.ubuntu.com/announceee7:commen
t29:Ubuntu CD releases.ubuntu.com10:created by13:mktorrent 1.113:creation datei1
677174459e4:infod6:lengthi1975971840e4:name36:ubuntu-22.04.2-live-server-amd64.i
so12:piece lengthi262144e6:pieces150760:*8

(if you’re wondering why we’re not using jq for this, it’s because in the JSON output the binary string has been encoded, presumably in a UTF variant - which also explains why fold is essentially truncating the output because there must be a null byte somewhere in that giant string)

Things are a little different for torrents containing multiple files! Let’s take the .torrent for QubeOS:

/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/Users/axiomiety/Downloads/Qubes-R4.2.2-x86_64.torrent | cut -c 1-80
{
  "announce": "udp://tracker.torrent.eu.org:451",
  "announce-list": [
    [
      "udp://tracker.torrent.eu.org:451"
    ],
    [
      "udp://tracker.opentrackr.org:1337/announce"
    ],
    [
      "https://tracker.gbitt.info:443/announce"
    ],
    [
      "http://tracker.gbitt.info:80/announce"
    ]
  ],
  "created by": "mktorrent 1.1",
  "info": {
    "files": [
      {
        "length": 6897252352,
        "path": [
          "Qubes-R4.2.2-x86_64.iso"
        ]
      },
      {
        "length": 1251,
        "path": [
          "Qubes-R4.2.2-x86_64.iso.DIGESTS"
        ]
      },
      {
        "length": 833,
        "path": [
          "Qubes-R4.2.2-x86_64.iso.asc"
        ]
      }
    ],
    "name": "Qubes-R4.2.2-x86_64",
    "piece length": 1048576,
    "pieces": "ɱ\ufffd]aG\ufffd\ufffd\ufffd\ufffda\ufffd7\u003cru\ufffd.bK\ufff
  },
  "url-list": [
    "https://mirrors.kernel.org/qubes/iso/",
    "https://ftp.qubes-os.org/iso/"
  ]
}

The info map still contains the familiar piece length and pieces keys, with the addition of a new files list that contains both the path and length of each individual file.

A quick calculation shows we should have:

>>> math.ceil((6897252352+1251+833)/1048576)
6578

pieces, and we therefore expect pieces to be of length 6578*20=131560:

/V/r/g/src ❯❯❯ cat ~/Downloads/Qubes-R4.2.2-x86_64.torrent | fold -w 80 
d8:announce32:udp://tracker.torrent.eu.org:45113:announce-listll32:udp://tracker
.torrent.eu.org:451el42:udp://tracker.opentrackr.org:1337/announceel39:https://t
racker.gbitt.info:443/announceel37:http://tracker.gbitt.info:80/announceee10:cre
ated by13:mktorrent 1.14:infod5:filesld6:lengthi6897252352e4:pathl23:Qubes-R4.2.
2-x86_64.isoeed6:lengthi1251e4:pathl31:Qubes-R4.2.2-x86_64.iso.DIGESTSeed6:lengt
hi833e4:pathl27:Qubes-R4.2.2-x86_64.iso.asceee4:name19:Qubes-R4.2.2-x86_6412:pie
ce lengthi1048576e6:pieces131560:ɱ

An interesting caveat in all this is that a single block can contain bytes belonging to more than one file! Take this example:

file1=120
file2=40
file3=30

Total size is 120+40+30=190. If the block size was 100, block 2 would contain the last 20 bytes of file, along with the bytes of file2 and file3. It’s a tiny detail but it does emphasise how each block is important (we’ll revisit this when requesting blocks from peers, especially with respect to block sizes).

Creating our own .torrent

To make sure we understand how this works, here’s a short snippet that creates 3 files with pseudo-random data (I appreciate it isn’t in go but heh):

import random
import pathlib
random.seed(0xdeadbeef)
files = [("file1", 7), ("file2", 2), ("file3", 3)]
for filename, filesize_in_mb in files:
    with open(filename, "wb") as fh:
        fh.write(bytearray(random.getrandbits(8) for _ in range(filesize_in_mb*1_000*1_000)))

This should allow you to re-create the same files on your machine - here are the SHA1 digests:

/t/files ❯❯❯ python3 gen-random-files.py
/t/files ❯❯❯ ls -lh file*
-rw-r--r--@ 1 axiomiety  wheel   6.7M Aug 30 15:25 file1
-rw-r--r--@ 1 axiomiety  wheel   1.9M Aug 30 15:25 file2
-rw-r--r--@ 1 axiomiety  wheel   2.9M Aug 30 15:25 file3
/t/files ❯❯❯ shasum file*
758d2401caa0d71d71cffd84d8491c6b07a5cb5f  file1
2035dbcd7c76b22f3112426ceebffe75117af26d  file2
6149596f744de4098ec1d43dc1999cc4c32a40a0  file3

A quick back-of-the-envelope calculation shows that for the files above, using a block size of 64 KiB, we should have a total of 184 blocks - so 184 digests:

>>> 12*1_000_000/2**16
183.10546875

We can use split to simulate each block, just to ensure we’re computing the hashes correctly. We’re particularly interested in ensuring we handle file and block boundaries correctly. For argument’s sake let’s print the digests of the first and last 3 blocks:

/t/files ❯❯❯ cat file* | split -b 65536
/t/files ❯❯❯ ls -1 x* | wc -l
     184
/t/files ❯❯❯ ls
file1 xad   xaj   xap   xav   xbb   xbh   xbn   xbt   xbz   xcf   xcl   xcr   xcx   xdd   xdj   xdp   xdv   xeb   xeh   xen   xet   xez   xff   xfl   xfr   xfx   xgd   xgj   xgp   xgv   xhb
file2 xae   xak   xaq   xaw   xbc   xbi   xbo   xbu   xca   xcg   xcm   xcs   xcy   xde   xdk   xdq   xdw   xec   xei   xeo   xeu   xfa   xfg   xfm   xfs   xfy   xge   xgk   xgq   xgw
file3 xaf   xal   xar   xax   xbd   xbj   xbp   xbv   xcb   xch   xcn   xct   xcz   xdf   xdl   xdr   xdx   xed   xej   xep   xev   xfb   xfh   xfn   xft   xfz   xgf   xgl   xgr   xgx
xaa   xag   xam   xas   xay   xbe   xbk   xbq   xbw   xcc   xci   xco   xcu   xda   xdg   xdm   xds   xdy   xee   xek   xeq   xew   xfc   xfi   xfo   xfu   xga   xgg   xgm   xgs   xgy
xab   xah   xan   xat   xaz   xbf   xbl   xbr   xbx   xcd   xcj   xcp   xcv   xdb   xdh   xdn   xdt   xdz   xef   xel   xer   xex   xfd   xfj   xfp   xfv   xgb   xgh   xgn   xgt   xgz
xac   xai   xao   xau   xba   xbg   xbm   xbs   xby   xce   xck   xcq   xcw   xdc   xdi   xdo   xdu   xea   xeg   xem   xes   xey   xfe   xfk   xfq   xfw   xgc   xgi   xgo   xgu   xha
/t/files ❯❯❯ shasum x* | head -3
294faa783957ea41ff3641f1b52fa85cf4bec89a  xaa
e96b680d153fbfc9cde2239f80f956d3d8f3c183  xab
f60bbfb3eb95d95faea7586887a22765179ea8d0  xac
/t/files ❯❯❯ shasum x* | tail -3
86c24415d0d187c593cf25518e12b0ff0e878e82  xgz
4a2f5281f9d6f4d682b59d6f5b685b4580116d3d  xha
c86b8537f69e8f48ce338ee264ba56927f4b9f79  xhb

In a nutshell we will:

  1. create a buffer the size of a block
  2. fill that buffer by reading raw bytes from the given files 1. if we reach EOF before filling up a buffer and there are more files to be read, continue
  3. once the buffer is full, calculate the SHA1 digest and append it to pieces

There are probably more efficient ways to do it but this works a treat:

func calculatePieces(pieceLength int, filenames []string) string {
	/*
		pieces are calculated based on the continuous byte stream
		of the provided files
	*/
	var pieces bytes.Buffer
	pieceBuffer := bytes.NewBuffer(make([]byte, 0, pieceLength))
	h := sha1.New()
	for _, filename := range filenames {
		f, err := os.Open(filename)
		common.Check(err)
		defer f.Close()
		for {
			bytesToRead := pieceBuffer.Available()
			readBuffer := make([]byte, bytesToRead)
			bytesRead, err := f.Read(readBuffer)
			pieceBuffer.Write(readBuffer[:bytesRead])
			// time to process the next file, if any
			if err == io.EOF {
				break
			}
			// we filled a full piece - let's hash it
			// and append the digest
			if pieceBuffer.Available() == 0 {
				h.Write(pieceBuffer.Bytes())
				pieces.Write(h.Sum(nil))
				fmt.Println(hex.EncodeToString(h.Sum(nil)))
				pieceBuffer.Reset()
				h.Reset()
			}
		}
	}
	if pieceBuffer.Available() != pieceBuffer.Cap() {
		// we have bytes left in the buffer!
		h.Write(pieceBuffer.Bytes())
		pieces.Write(h.Sum(nil))
		fmt.Println(hex.EncodeToString(h.Sum(nil)))
	}
	return pieces.String()
}

With the meat of it done, we can simply create the overall structure and call into the above. We just need to ensure that if multiple files are passed in, we set the info.files accordingly. Note that the order in which the files are listed mirror how pieces are calculated (so if you changed the order in the torrent, you’d expect pieces to change):

func CreateTorrent(outputFile string, announce string, name string, pieceLength int, filenames ...string) {
	// build the info dict
	infoDict := mkInfoDict(name, filenames, pieceLength)

	torrentMap := map[string]any{
		"announce":   announce,
		"created by": "go-bt",
		"info":       infoDict,
	}
	var buf bytes.Buffer
	bencode.Encode(&buf, torrentMap)
	err := os.WriteFile(outputFile, buf.Bytes(), 0644)
	common.Check(err)
}

func mkInfoDict(name string, filenames []string, pieceLength int) map[string]any {
	infoDict := map[string]any{"name": name}

	getNumBytes := func(filename string) int64 {
		stat, err := os.Stat(filename)
		common.Check(err)
		return stat.Size()
	}

	if len(filenames) == 1 {
		infoDict["length"] = getNumBytes(filenames[0])
	} else {
		files := make([]map[string]any, len(filenames))
		for idx, filename := range filenames {
			files[idx] = map[string]any{
				"path":   []string{filename},
				"length": getNumBytes(filename),
			}
		}
		infoDict["files"] = files
	}
	infoDict["pieces"] = calculatePieces(pieceLength, filenames)
	infoDict["piece length"] = pieceLength
	return infoDict
}

Let’s see what we get:

/V/r/g/src ❯❯❯ go run ./main.go create -announce http://localhost:8088 -name foo -pieceLength 65536 -out /tmp/files.torrent /tmp/files/file*
294faa783957ea41ff3641f1b52fa85cf4bec89a
e96b680d153fbfc9cde2239f80f956d3d8f3c183
f60bbfb3eb95d95faea7586887a22765179ea8d0
...
86c24415d0d187c593cf25518e12b0ff0e878e82
4a2f5281f9d6f4d682b59d6f5b685b4580116d3d
c86b8537f69e8f48ce338ee264ba56927f4b9f79

Wohoo it’s a match!

Now whilst the above makes me feel quite confident about the implementation, let’s do a final check by validating against a known implementation. Here’s a torrent I created using the qBittorrent client, containing those same 3 files:

/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/tmp/files.torrent | jq -r '.info.pieces'  | shasum
b6d75c38ab9d2c526cba79570b4a7edadbf3767a  -
/V/r/g/src ❯❯❯ go run ./main.go bencode -decode=/tmp/qbittorrent.torrent | jq -r '.info.pieces'  | shasum
b6d75c38ab9d2c526cba79570b4a7edadbf3767a  -

It isn’t super scientific but it does show a match :)

Taking it further

One thing we’ll need later is the ability to read arbitrary blocks - e.g. if a client requires block 43, we have to compute the offset and grab the corresponding bytes, which could straddle files.

Another useful addition would be to validate the pieces section - quite a few clients will do this at start-up, to mark the downloaded blocks as good or bad. It’s also useful to do once a block has been downloaded to ensure (1) we received the data correctly and (2) we weren’t poisoned by a malicious peer.