How does git retrieve commits associated with a file?

Question

How does git retrieve commits associated with a file?

I am writing a simple parser for .git / * files. I have covered almost everything, like objects, refs, pack files, etc. But I have a problem. Let's say I have a large repository of 300M (in a package file) and I want to find out all the commits that changed the file / some / deep / inside / file. Now I am doing the following:

fetch last commit
find the file in it:
- fetching the parent tree
- finding a tree inside
- recursively repeats until I hit the file
- Also, I check the hashes of each subdirectory along the file path. If one of them is the same as in commit, then I assume the file has not been modified (since its parent dir has not changed)
then I save the hash of the file and get the parent commit
find the file again and check if the hash has changed
- if so, the original commit (i.e. one before the parent) changes the file

And I repeat this over and over until I get to the first commit.

This solution works, but it sucks. In the worst case, the first search can take up to 3 minutes (for a 300M package).

Is there a way to speed it up? I tried to avoid placing such large objects in memory, but now I don't see any other way. Even so, the initial memory load will be forever :(

Congratulations and thanks for your help!

+2

git python

liadan May 15 '10 at 21:40

a source to share

1 answer

araqnid · Answer 1 · 2010-05-16T15:46:03+0000

Which is the main algorithm that git uses to track changes to a specific file. Therefore, "git log - some / path / to / file.txt" is a relatively slow operation compared to many other SCM systems where it would be easy (for example, in CVS, P4, etc. Each repo file is a server history file file).

It shouldn't be so time-consuming to estimate: the amount you need to keep in memory is quite small. You already mentioned the main point: remember that tree IDs go down to the path to quickly eliminate commits that have not even touched that subtree. It is rare for tree objects to be very large, as are directories on the file system (not surprisingly).

Are you using the package index? If you don't, you will essentially have to unpack the entire package to find this, since trees can be at the end of a long delta chain. If you have an index, you still have to use delta to get your tree objects, but at least you can find them quickly. Keep a cache of application deltas, because, obviously, very often trees are used to reuse the same or similar bases. Most changes to the tree object simply change 20 bytes from the previous tree object. So if to get a T1 tree you need to start with a T8 object and apply Td7 to get T7, T6 .... etc, it is likely that these other T2-8 trees will be referenced again.

How does git retrieve commits associated with a file?

More articles: