Michal Kohutek

Mapping files into RAM with lazy loading (in C++)

Storing data in a non-human-readable binary format can be useful, because it takes much less space and parsing is easier and faster. It feels much more practical to access such a file like a vector rather than a file stream.

The stream functionality of regular file access in C++ is useful for parsing files, but when the files contain binary data, its only advantage over the old C style with getchar() is RAII. Storing data in a non-human-readable binary format can be useful, because it takes much less space and parsing is easier and faster. It feels much more practical to access such a file like a vector. I decided to make a small library to facilitate this.

For reasons that are part of NDA, it was designed with these additional non-functional requirements:

  1. The bytes by the start of the file are more likely to be needed than the ones by the end and the ones by the end are needed only if those by the start are needed
  2. In most cases if the file is written to, it will be written to frequently and by the end, otherwise not at all
  3. The files are not large enough to fill the RAM, so optimisation there isn’t worth the compromises
  4. Use a common interface to allow storing the data in compressed formats or some other ways

The basic idea

A RAII-compliant way to do this would be to read the file and fill a vector with its contents, give access to the vector’s contents and have the destructor copy it back from the vector into the file, truncating it before the start. Because the destructor cannot be trivial (it has to store the file name in addition to the flushing operation), there’s no way to do this by inheriting from std::vector.

The additional requirements make it a bit more complicated. Reading the whole file might not be necessary. Writing it might not be necessary.

Requirement 1 enables us to quicken the usage by reading only the bytes that are needed and only from the start. Having the caller announce how many bytes will be needed can be annoying. So it reads the file until the requested byte and some bytes ahead, which can be some percentage of the number of the requested byte with a lower bound (reading the first few bytes would take long) and an upper bound (to preserve RAM space). All bytes read can be stored in a vector and returned if requested without reading, because it’s assumed that skipping the start and reading a few bytes by the end of the file will not be needed.

Requirement 2 demands an optimisation for appending to the file. If the existing data are not modified, it should only append when flushed. This can be dealt with by keeping the last index of data that are already saved and keeping track if some modifications were done to the data in the file. The operator[] allows both writable and non-writable access, but it’s possible to distinguish between readable access and writable access using const-correctness. This can lead to a curious need to use const_cast to turn a non-const object into a const object.

Because of requirement 3, there is no need to remove bytes that were already accessed and presumably processed from RAM or other tricks to reduce memory usage.

Requirement 4 requires splitting the object into an interface and an object specifying the I/O. Making use of the lack of restrictions in implementation inheritance in C++, the interface class can contain the accessing logic. This allows us to avoid non-inlinable virtual functions when accessing the individual bytes and use virtual methods only for disk access that is much slower anyway.

The implementation

I first created a parent class named MemoryMappedFileBase that keeps track of the file name, the data and the modification state. It offers access to the data through operator[], checking if specific bytes can be read, interfaces to file access through load() and flush() functions and an interface with default implementations for appending.

Then I acreated a child class named MemoryMappedFileUncompressed for regular saving of the binary data to disk. It implements the load() and flush() functions and replaces the default-implemented appending functions so that they could be flushed by appending to file only (which might not be possible if the files were compressed).

The result is a class allowing a much more convenient way of accessing binary files:

MemoryMappedFileUncompressed file("saved_stuff");
file.push_back('a');
file.push_back('h');
file.push_back('o');
file.push_back('y');

for (int i = 0; file.canReadAt(i); i++) {
	std::cout << file[i];
}
std::cout << std::endl;

Objects instead of bytes

Reading the file byte by byte is annoying, because mostly the data will be numbers in integer or float binary formats, structs consisting of such numbers. In more contrived cases, there might also be fixed-width strings, flags or indexes pointing to other elements in the array (pointers cannot be stored this way and are usually needlessly large anyway).

This is implemented through a templated class named MemoryMappedFile that accesses the data stored in some child class of MemoryMappedFileBase contained within it without copying it byte by byte (using reinterpret_cast on pointers and recalculated boundary checks).

This allows storing objects with ease:

struct Entry {
	uint32_t timestamp;
	int32_t value;
	Entry(uint32_t timestamp, int32_t value) :
			timestamp(timestamp), value(value) {}
}
MemoryMappedFile<Entry, MemoryMappedFileUncompressed> file("stuff");
file.push_back(Entry(time(nullptr), 13));

std::cout << file[0].timestamp << " " << file[0].value << std::endl;
std::cout << "Size is now: " << file.size() << std::endl;

This can lead to problems with memory alignment. In the example, the struct is 8 bytes long and identical on 32-bit and 64-bit architectures, but if it needed 12 bytes, it might be of that size of 12 bytes on 32-bit architectures and 16 bytes on 64-bit architectures. It’s up to the user to align it properly if needed.

Compression

The size of the data can be reduced using compression. The repetitive pattern of the contained objects allows good compression ratios, even over 75%. As you have probably guessed from the name MemoryMappedFileUncompressed, this is done through another class named MemoryMappedFileCompressed that allows very similar byte access, but from data stored in an archive.

A very good compression algorithm is LZMA used in 7z and xz archive formats. Without giving it much thought, I have picked LZMA SDK to provide it, simply because it’s known for being the most efficient. Its interface is about as friendly as a Schutzstaffel deathsquad, so getting a few facade functions to it working was masochistic. Heck it even required supplying it a pointer to a function while assuming that 8 bytes before that pointer is a pointer the function will receive as its argument so that it could be passed information what file it has to work with. When I finally got it running, it turned out it relied on Windows API (I was on Windows because companies often get stuck with unwise choices done in the distant past). It could be redone using XZ Utils or by refactoring it (which it badly needs) and using std libraries for the Windows API functionality, but I am not going to do it soon.

It can be used by using MemoryMappedFileCompressed everywhere instead of MemoryMappedFileUncompressed in the example.

Conclusion

For reasons I cannot disclose, I have created a library for storing objects in files that optimise frequent appends and reading objects by the starts of files that allows both compressed and uncompressed access with an interface similar to vector. The source code is on my mu github page. Its headers are commented in high detail for more information.

Leave a Reply

Your email address will not be published. Required fields are marked *