First post in what will (hopefully?) be building an extremely simple BitTorrent client in Go. At the very least we’ll aimto parse .torrent files, connect to peers and download files. Just to manage expectations, incoming requests (uploads), DHT and the like will likely be left as an exercise to the reader.
Note: most (if not all) of the snippets should be using any instead of interface{} - the 2 are now interchangeable
If you have ever downloaded a .torrent file from say, Ubuntu and tried to open it in a file editor, you’ll be hard pressed to call this readable. The file is encoded in a format called bencode, the specification of which is available here. Here’s the first few lines for reference:
~/r/c/go-bt ❯❯❯ head src/ubuntu.torrent | fold -w 80
d8:announce35:https://torrent.ubuntu.com/announce13:announce-listll35:https://to
rrent.ubuntu.com/announceel40:https://ipv6.torrent.ubuntu.com/announceee7:commen
t29:Ubuntu CD releases.ubuntu.com10:created by13:mktorrent 1.113:creation datei1
681992889e4:infod6:lengthi2641264640e4:name34:ubuntu-23.04-live-server-amd64.iso
12:piece lengthi262144e6:pieces201520:%
The TL;DR is that we can use it to represent basic types - e.g. a positive integer is represented as a stream starting with i and ending with e. So 123 would be encoded as i123e. Lists have a similar setup - with l being the start delimiter. So the list [123,456] would be represented as li123ei456ee. Lists can contain other lists too!
Byte strings (I guess that’d be char in C?) are a little different in that they start with a number representing the length, delimited by : - so “foo” would be encoded as 3:foo. Don’t confuse this with UTF-8 type strings you have in Python - len("😜".encode("utf-8")) is actually 4 bytes!
Maps are similar to lists, starting with a d (easy heh?). Keys must be sorted in byte order and must be byte strings - so the Python dict {'foo': 123} would be encoded as d3:fooi123ee.
Approach
If you have ever used a stream-based parser like StAX for XML, or PyYAML, we’ll be doing something somewhat similar. Take this list with a nested element for example:
[1, [2], 3] = li1eli2eei3ee
As we process the stream, the structure looks a little bit like a tree:
[
1,
[
2
],
3
]
When we encounter the first [ we set the “node type” to a list, and recursively call our parser with the remaining of the stream. After reading 1 we look to see if we had a root note - we had and it’s a list, so we append the value to it. We keep recursively call ourselves with the remaining bytes of the stream. In pseudo code, this looks a bit like:
Here’s the general interface - we only need 2 methods. One to add value as we see them, and one to return the contents.
Here are the basic type (List, Dict and Value - the latter is for scalars of all kinds):
Their implementation of the Holder interface is as follows:
The only one worth a mention is the DictHolder. When we finish processing an element inside the dict, we need to know hwhether it was a key or a value - so we use c.Key internally as a toggle. If it’s set, the next call to Add is for a value - and if not, it’s the start of a key.
Parsing
Now we almost ready to start parsing the stream! Let’s start with some basic tests (TDD FTW!) to ensure we’re doing the right thing. It’s not full coverage but should be enough to help us flush out silly mistakes.
The parsing is split into 2 - we have the public interface ParseBencoded2 (2 because my first impl wasn’t something to be proud of), and the internal recursive parsing function called parseBencodeStream. The check function is a little helper to help catch issues early. I like a good debugging session as much as the next person but that’s simpler than letting errors propagate.
We kick things off with a ValueHolder (because we expect to return a single value - i1ei2e wouldn’t be “valid” per se, it’d need to be in a container) and a pointer to bufio.Reader, which allows us to continuously read form the same stream. We read a byte at a time in some cases, and in others until we hit a specific delimiter (e.g. e for integers, : for strings etc…). If we find a new container delimiter we create the appropriate holder (for a list or a map) and recurse accordingly.
Running our tests shows we’re home free:
/V/r/c/g/src ❯❯❯ go test ./data
ok go-bt/data 0.351s
Encoding
We’re half-way there. We can parse data, let’s make sure we can serialise it back. The beauty of this is that from a test perspective, encode(parse(data)) is equivalent to the identity function! That is, this is just equal to data. A bit like decompressing a compressed file.
Caveat emptor: the encoding code will be sufficient for BitTorrent but isn’t comprehensive enough to support all kinds of types!
If you’ve ever used reflection in say, Java, you’ll appreciate how terse it is in golang! As with the parser, we’ll need to recurse on each value in a container. For maps that means we’ll need to first encode the key, then the value. For lists, we’ll need to encode each elemenet.
As done previously, let’s start with some tests - we only need to specify the input and ensure that’s what we get back:
Now for abusing reflect with all kinds of types:
Note the default in the switch - there may be types we’re not aware of, and those will cause us to panic. Particularly if we passed in a struct of some kind we’d be toast.
Struct tags and parsing a .torrent file!
Be warned, this abuses reflect even moar
Now that we can marshall data in and out what we ideally want is to represent a .torrent file as a series of struct (it contains nested fields so it won’t be just one) and have it be populated automagically on the back of a bencoded stream. This is where tags come in handy - similarly to what one might do when marshalling to/from JSON.
Note that we only need to define tags for the fields we want populated automatically - fields that don’t contain tags will be left alone. Furthermore, we can nest data types - for example a .torrent file is, at the outset, nothing more than a map with some predefined keys - and some of those keys (like info) are maps themselves.
A subset of the fields could be represented as such:
Filling in a BETorrentstruct consists of iterating through each field and finding the appropriate match in the encoded data. Side note, announce-listis a list of list of strings… :shrug:
fillStruct is the workhorse that parses data both iteratively and recursively:
note it isn’t fully generic. For one it expects a map[string]any for the data that needs to be deserialised. Also the handling of reflect.Slice feels a bit wonky even though it works.
But can we go back?
We can parse a .torrent into a struct, and we can encode an arbitrary dictionary. But can we go from a struct back into a bencoded map? It turns out we’ll need this later, so let’s get it out of the way now. In a nutshell, those are the tests we want to satisfy. Note that the return value should be a map[string]any - we already know how to encode this kind of datastructure:
Leveraging the pattern that allowed us to go from dictionary to struct, we can reverse the order to through the struct and look for fields with a bencode tag (ignoring those that don’t have one or just as importantly, those that aren’t set):
Taking it further
.torrent files aren’t the only files leveraging bencode in BitTorrent. The tracker response for instance is another. I had to make a few changes to leverage generics - e.g.
where a subset of the relevant structs are defined as:
All in all this was rather fun - except the reflect part, which whilst satisfying to get right, was more of a pain than I care to admit. But it does make the structs that much easier to use!