Subversion beats Perforce in handling large files, and it's not even close

Monday, April 8, 2024

Contents

Introduction

The author is working on a high-fidelity open-world co-op VR game, and has been using git for some time. Git is a free and open-source Version Control System.

To the author, the most important feature of such Version Control Systems is that any change can be un-done. If the project is in a great state, the author can experiment with new systems risk-free; and if those large-scale changes don't work out, the project can be reverted back to a good state. The author finds this functionality to be essential. More generally, the version control system tracks the history of changes, and allows the user to go back in time to any "commit" (snapshot, essentially) that has been made.

Note that git is not GitHub - GitHub simply provides services around git technology. Git repositories can be self-hosted, either directly or through a wrapper such as OneDev or Gitea. On this note, the following data focuses on self-hosting. The author believes that self-hosting, when done correctly - provides optimal data security, integrity, and control. However, if the user is not comfortable with this and does not understand the requirements, it is safer for them to pay someone else to host their Version Control System.

Git is fine until larger files are committed - such as textures and models, commonly found in games. When cloning a git repository, the entire history is cloned. If the repository contains ten versions of a texture, the user downloads all of the data for those historical versions. If the texture was stored uncompressed, git will diff it and those changes could be small - but if not, then a 100MB texture could be consuming 1GB in the repository if there are ten versions of it, and all of this is downloaded when cloning. This becomes an increasing, but often not impossible, issue.

A bigger problem found by the author is that as the repository grew, git.exe on the server would consume more and more memory during clones. For the author's game, it was impossible to clone if the server had less than 20GB of memory. The author suspects that this is an issue with Git for Windows, but it could be an issue with Git itself. The repository size was 25GB.

The typically recommended solution is to use Git LFS, a third party add-on for git, intended to handle large files. When configured correctly, Git LFS will store designated file types outside of the repository, on an LFS server. This prevents the repository itself from growing too much, but separates the data. The major drawback to this is that LFS data is not stored as deltas. Instead, each revision of a file is stored in its entirety. This can cause the total data size to grow substantially, as git compression is quite effective and LFS bypasses it.

Additionally, there are many complaints of issues with LFS - and both times the author attempted to use it, they could not get it to function correctly. Each time the repository was cloned, different LFS files would fail to fetch, and the repository could not be cloned correctly at all. Many LFS files would exist as .part files, and could not be completed. If it was cloned again, other files would exist as .part files - one clone would contain a complete copy of a particular file, and another clone would not. Git LFS, in the author's experience, is unreliable.

Wouldn't it be great if there was a version control system which just worked? A system which could store everything, and it would effectively store deltas between files of all types? A system which facilitated a clone with as little as 20MB of memory usage on the server, and would not require the user to download every revision of every file that ever existed?

It exists, and it is called Subversion. (commonly abbreviated "svn")

Alternatives to Git

For high-fidelity game development, the author's experience is that git becomes unsustainable. Even when git was used, the author was extremely careful when committing files. The author would avoid committing any large assets until they were confident the assets were needed. This workflow became difficult, and inevitably, the author's home server would have been unable to support cloning the repository.

When looking for alternatives, the issue of large files is commonly noted, and the most frequent recommendation other than Git LFS, is to use a system called Helix Core, developed by Perforce. The system was formerly named Perforce Helix, and many continue to refer to it as "Perforce".

When the author searched for comparisons of Version Control Systems, they found many links from Perforce themselves. These links largely dismiss other version control systems, and tell the viewer to use Perforce's own system. Even when searching "git vs svn", the first result seen by the author is this page from Perforce: https://www.perforce.com/blog/vcs/git-vs-svn-what-difference

The author believes that Perforce should have no involvement in this comparison, but Perforce won't miss an opportunity to tell the viewer that their own system is better.

A screenshot from Perforce's comparison of git and svn. They will not miss an opportunity to push the user to their own product. From https://www.perforce.com/blog/vcs/git-vs-svn-what-difference.

When the author searched "helix core vs svn", they again found results from Perforce - which is fine, the query now involves their product. What is not fine, is that Perforce continues to make bold claims, and they continue to provide no data at all to support those claims. Perforce now bases these claims on "conventional wisdom", which the author finds to be un-scientific.

A screenshot from https://www.perforce.com/resources/vcs/perforce-vs-svn, where Perforce unfairly claims that Subversion shows inferior performance, and that their system is better. No data is provided, only "conventional wisdom".

In this case, Perforce has written an article to compare Subversion against Helix Core, and they state that it is "hard to find concrete benchmark data". The author finds this to be unacceptable. Perforce should have completed their own tests - and if they are unwilling to do this, they should not have written the article.

The author sees that these links pollute the search results and provide little useful information, which is the author's main reason for writing this. The image above shows bold claims, with zero data to support them. What Perforce has provided, is simply marketing; and on that point, the user must provide all of their contact information prior to accessing Perforce's downloads.

The author sees that Perforce has done well with their marketing, and many have believed it. From everything the author had read, it was assumed that Subversion would be complex and lacking in features, and that Helix Core would simply be better, but this is not the case at all.

The main reasons people cite when recommending Perforce Helix Core, are that it can handle large files, and that it is more scalable than the alternatives. Let us investigate this.

How are large files handled by Subversion and Helix Core?

On Perforce's website, they make the following claim:

Helix Core is better than SVN for binary file management. SVN can handle multiple smaller binary files fine. It handles single large files fine, but performance slows or fails when dealing with multiple large binary files. For teams working with many large art assets, like game development and virtual production teams, Subversion simply won’t cut it.

https://www.perforce.com/blog/vcs/version-control-for-binary-files

Perforce again provides no data - so the author will create some.

In these tests, Subversion is accessed over the svn+ssh protocol.

The Subversion and Helix Core repositories in these tests were stored on the author's home server, on a Toshiba N300 8TB (mechanical), unless otherwise specified. The disk is generally idle. The server has 32GB RAM with an i5-4670k CPU, and runs Windows 10.

The client machine uses a 7950X3D with Windows 11. Client-side working copies were stored on a Corsair MP600 PRO LPX 2TB.

In these tests, explicit binary prefixes are used for data sizing, where practical to do so. For example, "1 MiB". This is equal to (1024^2) bytes. All underlying data was counted to the byte using logical (not physical) file sizes.

Test 1 - The 32 MiB Delta

Two files of size 10GiB, and thirty files of size 1GiB, were generated; and then filled with random data, for a total dataset of 50GiB. The 1GiB files could be representative of images. The 10GiB files are included as Perforce has not defined "large", and 10GiB is taken to be very large for a game asset.

These were then committed to a Subversion repository, and a Helix Core repository.

Random data was then written to the first megabyte of each file - totalling 32 MiB of changed data. These changes were then committed.

Changes were committed using TortoiseSVN and P4V.

Results for Committing the Initial Files:

VCS Time Server Memory Usage (MB) New Repo Size (GiB)
Subversion 8min 24s 2.8MB + 6.6MB (svnserve + sshd) 50.009
Helix Core 26min 44s 21.2MB 50.017

 

Results for Committing a 1 MiB change to each file (32 MiB total change):

VCS Time Server Memory Usage (MB) New Repo Size (GiB)
Subversion 10min 44s 3.7MB 50.050
Helix Core 26m 33s Not Measured 100.032

 

Additional data:

  • SVN repo size before: 53,696,308,812. After: 53,740,282,077.
  • During the initial commit, SVN was limited by gigabit ethernet. For the second commit, it appeared to be single-core CPU limited on the client, processing one file at a time.
  • Helix Core repo size before: 53,705,323,922. After: 107,408,931,564.
  • During both submissions, Helix Core appeared to be single-core CPU limited on the server, which limited bandwidth to around 280Mbps. During SVN's initial commit, svnserve and sshd together consumed a similar amount of CPU to Helix, but enabled more than 3x the bandwidth with those same resources, and appeared to only be limited by the networking hardware. During the second SVN commit, there was not much CPU usage on the server as it was the client that calculated the deltas.
  • After committing, SVN on the client contained old pristine copies. The author ran TortoiseSVN's "Clean up", opting to vacuum pristine copies, which removed the old ones.
  • Server-side SVN stores loose files during commits (inside db/transactions), but automatically packs them before completing the commit.

Test 2 - Many Files

In this test, files sized 1kiB are created, each containing random data. These are then added and committed. Test 2 was repeated with 1,000 files, 10,000 files, and 50,000 files.

For this test, the server-side storage device was changed from an 8TB Toshiba N300, to a 1TB Samsung 850 EVO. This benefits both Subversion and Helix Core.

Changes were added and committed using the svn command-line and p4 command-line.

Test 2.1: 1,000 Files

VCS Add Time (s) Commit Time (s) Server Memory Usage (MiB) Repo Size Change (GiB) Server File Count Increase
Subversion 0.5 12 Not Measured 0.000GiB 2
Helix Core 3 2.5 Not Measured 0.004GiB 1,000

 

Test 2.2: 10,000 Files

Before running each add and commit command, the Standby List (cached files in RAM) was cleared. This applies to 2.2; not to 2.1 or 2.3.

Between each numbered trial, the committed files are removed from the repository, and the removal is committed. This leaves the local working copy empty, whilst the history is retained on the server.

Subversion

Trial # Add Time (s) Commit Time (s) Server Memory Usage (MiB) Repo Size Change (GiB) Server File Count Increase
0 4 131 12.8 0.015 2
1 4 120 12.7 0.015 2
2 4 123 Not Measured 0.015 2

Final Repository Size: 46 MiB

Helix Core

Trial # Add Time (s) Commit Time (s) Server Memory Usage (MiB) Repo Size Change (GiB) Server File Count Increase
0 28 30 45.8 0.049 10,000
1 27 31 54.2 0.041 10,000
2 27 28 45.3 0.044 10,000

Final Repository Size: 210 MiB

Test 2.3: 50,000 Files

VCS Add Time (s) Commit Time (s) Server Memory Usage (MiB) Repo Size Change (GiB) Server File Count Increase
Subversion 18 9,300 45.2 0.041 2
Helix Core 140 155 252.5 0.190 50,000

 

Test 3 - Backup

If the work in the repository has any value, then it is of critical importance that the repository is regularly backed up.

The repositories are taken in their state at the end of Test 2.2, with equal histories. The Subversion repository has 46 files at 46 MiB; the Helix Core repository has all of the same data, with a file count of 30,120 at 210 MiB. These discrepancies are caused by fundamental differences in the two Version Control Systems.

To back up the repository, it is archived in RAR5 format using WinRAR 5.71 (the author was unaware that their home server was running this out-dated version and has since updated WinRAR). Default settings are used, with the following changes:

  • Compression Method: Store.
    • This tells WinRAR not to compress the data, as it is already compressed.
  • Add Recovery Record, 1%.
    • For backup resiliency.
  • Test archived files.
    • This tests the archive immediately following its creation.

The archive is created on the same disk as the repository. For this test, the repositories are first backed up on the 1TB Samsung 850 EVO. They are then moved to the 8TB Toshiba N300, and are backed up on that disk for comparison.

Before running each backup, the Standby List was cleared.

VCS 850 Evo Time (s) N300 Time (s)
Subversion 1 1
Helix Core 78 95

 

Discussion

The tests have revealed three substantial issues.

  1. Helix Core did not diff the files. A 1MiB change in a 10GiB file, means that Helix Core will store the entirety of the 10GiB file again, massively increasing storage requirements.
  2. Helix Core stores each file separately on disk, making it slow to back-up, as each separate file incurs a new file read overhead. This is why git and Subversion store as few files as they can.
  3. Subversion slows to a crawl when committing tens of thousands of files.

Issues 1 and 2, with Helix Core, are not resolvable. The author believes that issue 3 with Subversion is caused by comparing each added file with each other added file, and this causes an exponential increase in commit time as more and more files are added to the same commit, past some threshold. This issue is unfortunate, but is thankfully resolvable by committing chunks of files at once, e.g. in batches of 10,000. This issue is most likely to appear at most once, for those migrating to Subversion; but if the user's workflow does produce tens of thousands of files per commit, it may be worth creating a "batch commit" script, which would split the commit into chunks.

Helix Core uses much more disk space than Subversion on the server. This is because it only diffs text files, and stores new files of non-text formats in full, not attempting to diff them with a prior version to save disk space. This causes a massive growth of the repository. The efficiency of Helix Core's text file diffing and storage has not been investigated.

Note that Test 2.2, for 10,000 files, is separated into trials for each Version Control System. Firstly, see that Helix Core is faster at committing large numbers of small files. This commit speed is a trade-off, as it leaves the repository with tens of thousands of loose files and stores larger amounts of data. Although not included in the results, Helix Core also grows when executing "p4 add", to add files - this contacts the server and writes data, which is why the command is slow. Furthermore, when deleting files, the Helix Core repository grows substantially, whereas Subversion grows only by a small amount. Test 2.2 was run on clean repositories. All of these factors together mean that, at the end of this test, the Helix Core repository was 4.55x the size of the Subversion repository (210 MiB vs. 46 MiB), and they contain the exact same history. Due to the number of loose files, simply checking the size of the Helix Core repository in Windows File Explorer took 8 seconds, whereas it was instantaneous for Subversion, as it stores a very low file count.

As Helix Core stores each file separately, this creates a further issue. In Test 2 (Many Files), each test file was sized 1kiB. Commonly, a disk will store data in 4kiB clusters. This means that when a 1kiB file is stored directly on the file system, it will occupy 4kiB of space on disk. Helix Core stores each of these small binary files individually, which results in each of those files consuming an extra 3kiB of physical disk space. This has not been factored into the test results - the size reported in the tests is the logical size, not the physical size. For Test 2.2 with 10,000 files, although the logical size of the resultant Helix Core repository is 210 MiB, the physical size is 298 MiB; therefore the Helix Core repository is occupying 6.49x the physical disk space of the Subversion repository.

As a reminder, Perforce claims that Subversion struggles with "multiple large binary files". Test 1 was intended to address this. In every category, Subversion outperformed Helix Core substantially. Subversion committed the initial files in one-third of the time it took Helix Core, and Subversion was still much faster in committing the changes. Subversion calculated deltas and sent those over the network. Even though Subversion completed its work much faster, it managed to effectively compress the data, and the repository size was reduced to half of the Helix Core version. Each time these files are modified, Subversion can efficiently store deltas, but Helix Core will continue to cause massive increases in the size of the repository; opting to send the entire dataset over the network again, and store it in full again. Test 1 modified only 32 MiB of data, yet Helix Core sent the entire 50 GiB over the network, and increased its storage requirements by the full 50GiB.

How then, is Helix Core better than Subversion at handling large files?

Whilst Helix Core can handle large files, it does not do so intelligently. It shares this trait with Git LFS, which the author believes would also have doubled the size of LFS storage for these small changes. Subversion, however, does handle these files intelligently. It stores deltas, and the repository size increased by only 41MiB. These files can be modified as many times as the user requires, and Subversion will keep the storage requirements at a reasonable level. Note that 32MiB of data was changed, yet Subversion has stored 41MiB in new data. Deltas require many CPU cycles to compute, and the author expects that Subversion has opted for a particular balance which works well. This balance likely increases the size of the delta in exchange for faster processing.

In Test 1, Helix Core takes a longer time to produce a poor result. Imagine if the author repeated the test of changing 32 MiB of data, ten times; Helix Core would be storing 550 GiB of data, and Subversion would have tracked the exact same changes using approximately 50.5 GiB of data. Perforce boasts that they have clients with "petabytes" of data, and now the reader can understand why. If those clients switched to Subversion, they would likely be able to reduce their storage requirements.

Test 3 shows the backup results. It is extremely important to back up data in the repository. This is even more important with centralised version control systems, such as Subversion and Helix Core. Whereas git stores its entire history on the client, giving each client a backup of the repository; with Subversion and Helix Core, the history only exists on the server, in one place. For this reason, the author's Subversion repositories are mirrored each hour to a second disk. Each day, they are then backed up. A mirror is not a backup as it automatically inherits potential corruptions, however both mirrors and backups will suffer from having to copy more data, and more files. As Helix Core repositories are much larger, and store so many loose files, the backup operations become very slow and intensive on the server; whereas with Subversion, they are much more performant, and remain performant across the lifetime of the repository. Subversion leaves the repository in a much more efficient state - the low file count, and low data size, enable the user to efficiently administer it. This is an extremely important difference between the two Version Control Systems. Data integrity is paramount; and sooner or later, the server will experience some type of fault. With faster, smaller backups, the user can create backups with greater frequency, and store more of them.

One of the common negative remarks about Subversion is one it shares with git - which is the presence of the "pristine" client-side copy. This "pristine" copy exists in the client's working directory, as a duplicate of the working files. The existience of this pristine copy allows Subversion to calculate deltas for changed files, and to send those over the network, rather than sending the entire file. Another benefit is fast reverts to the pristine copy, and fast diffs between files changed from the pristine copy. However, as a repository grows, the size of the pristine copy could become difficult to manage - it stores a duplicate of every single committed file in the current workspace. If many of these files do not change from commit to commit, as is the case for the author's game, then most of those pristines should not need to be stored. This is a negative point against Subversion - however, a new feature "pristines-on-demand", is scheduled for Subversion 1.15 in 2024. This will allow removal of the client-side pristines, and fetch them on-demand for diffs, deltas and reverts. The feature is currently available in pre-release and the author has not tested it; but based on the Subversion mailing list, it appears that the feature is in working order. The author is excited to see this release, and wanted to mention this to clarify that Subversion is still being developed and improved.

When the author handed over their contact information to access Perforce's downloads, they received an email offering setup help. The author was in contact with Perforce during the writing of this article, and asked questions as to Perforce's marketing claims, and some of the performance issues noted in this article. The author asked for permission to publish some of their responses, but Perforce did not consent to this.

Conclusion

With the exception of committing large numbers of small files in a single commit, these tests have shown that Subversion is faster, more efficient, and would appear to be more scalable, when compared with Helix Core. Subversion is also Free and Open Source, developed and maintained by the community under the Apache umbrella; whereas Helix Core is closed-source, proprietary software.

The feature set of Subversion, and of TortoiseSVN, has already impressed the author; who is happy to see that Subversion is still being worked on. The author's experience with Subversion has been very positive, but its usage in the community is low. The author attributes this to the rise of accessible git hosts, such as GitHub, which did not exist or take off in the same way for Subversion. This allowed git to become the standard. Git works great for projects of a small overall data size, but for games with larger assets, it becomes a problem; and in such cases, Subversion seems to work better on the whole.

The author has found Subversion to be quite easy to use, and it seems simpler than Helix Core. Helix Core can be misleading and unhelpful - for example, when attempting to run commands with their p4 command-line utility, the author often received the error message "Password must be set before access can be granted", even when simply trying to check the status of the local working copy with "p4 status" (and when this does work, it reads the entire working copy from disk to figure out the status). The author tried the login command, they tried setting P4PASSWD. The error message was misleading - the actual issue was that P4USER was incorrect; the password was entirely unrelated. When the author deleted a local working directory to remove files that were no longer needed, Perforce's GUI, P4V, then refused to open - just because this folder was not deleted through their system. To resolve this, the author had to execute "p4 clean" and try to open P4V a few more times.

On the contrary, when the author accidentally cancelled an operation with TortoiseSVN, the working copy was left in an invalid state. TortoiseSVN asked the author to run the clean-up command, which failed - so it asked to run the command again, with the option to break write locks - and it worked. TortoiseSVN led the author on the correct path with the issue that was encountered, and it was easily resolved. Any Subversion working copy can be removed through File Explorer without issue, as expected. The author also likes that it works well with ssh.

The author's experience is that the Subversion tooling is simple, helpful, and yet powerful - and they have found Helix Core to be less pleasant to use. Based on the information in this article, the author has determined Subversion to be a better technology for their usage.

If the reader would like to get started with Subversion, the author recommends downloading TortoiseSVN, which optionally installs the command-line utilities. The manual is very insightful: https://tortoisesvn.net/docs/release/TortoiseSVN_en/index.html

Back to Blog