This change allows the introducer to become the only goroutine to
modify the root, which in turn allows the introducer to greatly reduce
its root lock holding surface area.
Two cases in this commit...
If we're shutting down, the merger might not have handed off its
latest merged segment to the introducer yet, so the merger still owns
the segment and needs to Close() that segment itself.
In persistSnapshot(), there migth be cases where the persister might
not be able to swap in its newly persisted segments -- so, the
persistSnapshot() needs to Close() those segments itself.
Some codepaths in persistSnapshot() were saving errors into an err2
local variable, which might lead incorrectly to commit during an error
situation rather than abort.
The slow merger was lagging behind the fast persister
to a persister notify send-loop while the persister awaits
for any new introductions from introducer totally blocking
the merger
This fix along with the deleted files eligibilty flipping
makes the file count to around 6 to 11 files per shard
for both travel and beer samples
The zap SegmentBase struct is a refactoring of the zap Segment into
the subset of fields that are needed for read-only ops, without any
persistence related info. This allows us to use zap's optimized data
encoding as scorch's in-memory segments.
The zap Segment struct now embeds a zap SegmentBase struct, and layers
on persistence. Both the zap Segment and zap SegmentBase implement
scorch's Segment interface.
zap command-line tool added to main bleve command-line tool
this required physical relocation due to the vendoring used
only on the bleve command-line tool (unforseen limitation)
a new scorch command-line tool has also been introduced
and for the same reasons it is physically store under
the top-level bleve command-line tool as well
Observed problem:
Persisted index state (in root bolt) would contain index snapshots which
pointed to index files that did not exist.
Debugging this uncovered two main problems:
1. At the end of persisting a snapshot, the persister creates a new index
snapshot with the SAME epoch as the current root, only it replaces in-memory
segments with the new disk based ones. This is problematic because reference
counting an index segment triggers "eligible for deletion". And eligible for
deletion is keyed by epoch. So having two separate instances going by the same
epoch is problematic. Specifically, one of them gets to 0 before the other,
and we wrongly conclude it's eligible for deletion, when in fact the "other"
instance with same epoch is actually still in use.
To address this problem, we have modified the behavior of the persister. Now,
upon completion of persistence, ONLY if new files were actually created do we
proceed to introduce a new snapshot. AND, this new snapshot now gets it's own
brand new epoch. BOTH of these are important because since the persister now
also introduces a new epoch, it will see this epoch again in the future AND be
expected to persist it. That is OK (mostly harmless), but we cannot allow it
to form a loop. Checking that new files were actually introduced is what
short-circuits the potential loop. The new epoch introduced by the persister,
if seen again will not have any new segments that actually need persisting to
disk, and the cycle is stopped.
2. The implementation of NumSnapshotsToKeep, and related code to deleted old
snapshots from the root bolt also contains problems. Specifically, the
determination of which snapshots to keep vs delete did not consider which ones
were actually persisted. So, lets say you had set NumSnapshotsToKeep to 3, if
the introducer gets 3 snapshots ahead of the persister, what can happen is that
the three snapshots we choose to keep are all in memory. We now wrongly delete
all of the snapshots from the root bolt. But it gets worse, in this instant of
time, we now have files on disk that nothing in the root bolt points to, so we
also go ahead and delete those files. Those files were still being referenced
by the in-memory snapshots. But, now even if they get persisted to disk, they
simply have references to non-existent files. Opening up one of these indexes
results in lost data (often everything).
To address this problem, we made large change to the way this section of code
operates. First, we now start with a list of all epochs actually persisted in
the root bolt. Second, we set aside NumSnapshotsToKeep of these snapshots to
keep. Third, anything else in the eligibleForRemoval list will be deleted. I
suspect this code is slower and less elegant, but I think it is more correct.
Also, previously NumSnapshotsToKeep defaulted to 0, I have now defaulted it to
1, which feels like saner out-of-the-box behavior (though it's debatable if the
original intent was perhaps instead for "extra" snapshots to keep, but with the
variable named as it is, 1 makes more sense to me)
Other minor changes included in this change:
- Location of 'nextSnapshotEpoch', 'eligibleForRemoval', and
'ineligibleForRemoval' members of Scorch struct were moved into the
paragraph with 'rootLock' to clarify that you must hold the lock to access it.
- TestBatchRaceBug260 was updated to properly Close() the index, which leads to
occasional test failures.
+ Adding new entries to the stats struct of scorch.
+ These stats are atomically incremented upon every segment
introduction, and upon successful persistence.
With the previous commit, there can be a scenario where batches that
had internal-updates-only can be rapidly introduced by the app, but
the persisted notifications on only the very last IndexSnapshot would
be fired. The persisted notifications on the in-between batches might
be missed.
The solution was to track the persisted notification channels at a
higher Scorch struct level, instead of tracking the persisted channels
at the IndexSnapshot and SegmentSnapshot levels.
Also, the persister double-check looping was simplified, which avoids
a race where an introducer might incorrectly not notify the persister.
This commit improves handling when an incoming batch has internal-data
updates only and no doc updates. In this case, a nil segment instead
of an empty segment instance is used in the segmentIntroduction. The
segmentIntroduction, that is, might now hold only internal-data
updates only.
To handle synchronous persistence, a new field that's a slice of
persisted notification channels is added to the IndexSnapshot struct,
which the persister goroutine will close as each IndexSnapshot is
persisted.
Also, as part of this change, instead of checking the unsafeBatch flag
in several places, we instead check for non-nil'ness of these
persisted chan's.
A race & solution found by Marty Schoch... consider a case when the
merger might grab a nextSegmentID, like 4, but takes awhile to
complete. Meanwhile, the persister grabs the nextSegmentID of 5, but
finishes its persistence work fast, and then loops to cleanup any old
files. The simple approach of checking a "highest segment ID" of 5 is
wrong now, because the deleter now thinks that segment 4's zap file is
(incorrectly) ok to delete.
The solution in this commit is to track an ephemeral map of filenames
which are ineligibleForRemoval, because they're still being written
(by the merger) and haven't been fully incorporated into the rootBolt
yet.
The merger adds to that ineligibleForRemoval map as it starts a merged
zap file, the persister cleans up entries from that map when it
persists zap filenames into the rootBolt, and the deleter (part of the
persister's loop) consults the map before performing any actual zap
file deletions.
A new global variable, NumSnapshotsToKeep, represents the default
number of old snapshots that each scorch instance should maintain -- 0
is the default. Apps that need rollback'ability may want to increase
this value in early initialization.
The Scorch.eligibleForRemoval field tracks epoches which are safe to
delete from the rootBolt. The eligibleForRemoval is appended to
whenever the ref-count on an IndexSnapshot drops to 0.
On startup, eligibleForRemoval is also initialized with any older
epoch's found in the rootBolt.
The newly introduced Scorch.removeOldSnapshots() method is called on
every cycle of the persisterLoop(), where it maintains the
eligibleForRemoval slice to under a size defined by the
NumSnapshotsToKeep.
A future commit will remove actual storage files in order to match the
"source of truth" information found in the rootBolt.