The main framework of the etcd storage engine: boltdb

Background of the Article#

Etcd is an open-source distributed key-value storage system primarily used for shared configuration and service discovery. In modern distributed service architectures, etcd plays a crucial role as it provides a reliable, consistent, and efficient storage solution for maintaining cluster state and configuration information.

Why Use Etcd as a Configuration Center:#

Consistency and Reliability: Etcd uses the Raft consensus algorithm to ensure that configuration data remains consistent and reliable even in the event of node failures in a distributed environment. This is critical for systems that require high availability.
Simplicity: Etcd offers a simple API that supports gRPC and HTTP, making it easy for clients to interact with the storage system. This reduces the complexity of developing and maintaining distributed systems.
High Performance: Etcd is designed to handle a large number of concurrent requests and can quickly respond to client read and write operations, which is very important for distributed services that require real-time configuration updates.
Service Discovery: In addition to configuration management, etcd also supports service discovery, allowing service instances to register themselves upon startup and deregister upon shutdown. This way, other services can discover and connect to these service instances by querying etcd.
Version Control: Etcd provides a version control mechanism, meaning that every change to the configuration is recorded and tracked. This is useful for auditing and rolling back configuration changes.
Cross-Language Client Libraries: The etcd community provides client libraries in multiple languages, allowing developers to interact with etcd using the programming languages they are familiar with.

Importance of Etcd in Distributed Services:#

Cluster Coordination: In distributed systems, coordination between services is crucial. Etcd can store cluster state information, helping each service instance understand the overall state of the cluster.
Dynamic Configuration: In a dynamically changing distributed environment, service configurations may need to be updated frequently. Etcd provides centralized configuration storage, allowing updates to be quickly and consistently distributed to all relevant services.
Fault Recovery: In the event of node failures, etcd ensures that configuration data is not lost and can be quickly recovered, which is vital for the stability and reliability of the system.
Scalability: As services scale, the complexity of managing configurations increases. The design of etcd allows for horizontal scaling, enabling the addition of more nodes as services grow to maintain performance.

Research Value:#

Technical Depth: Studying etcd can help teams gain a deeper understanding of core concepts in distributed systems, such as consistency, fault tolerance, and data replication.
Practical Application: By researching etcd, teams can better design and implement their own distributed services, improving system reliability and maintainability.
Community Contribution: Engaging in the research and development of etcd can contribute to the open-source community while also gaining support and best practices from the community.
Industry Trends: Understanding and mastering key technologies like etcd helps teams keep up with industry trends in cloud computing and microservices architecture.

In summary, etcd not only provides robust support for distributed services at a technical level but also holds significant practical application and research value. For teams conducting technical research on configuration center technologies, in-depth study of etcd is a worthwhile investment.

Core Concepts#

Bolt in Relation to Etcd#

Etcd is a distributed coordination service with strong consistency, implemented in Golang, and relies on the Raft algorithm to ensure strong consistency and high availability. The corresponding open-source address is: github(etcd)

This article will explore the working principles of the storage layer engine BoltDB in etcd. In the architectural design of etcd, BoltDB occupies a specific layer, as shown in the diagram above. BoltDB is a single-node key-value storage system written in Go that directly persists data to disk and has the following characteristics:

Single-node Operation: BoltDB focuses on single-node environments and does not need to handle complex distributed consensus mechanisms, making its implementation simpler.
Persistent Storage: Data is stored on disk in key-value pairs, ensuring data persistence and reliability.
Local I/O: When performing data read and write operations, BoltDB interacts directly with the local file system, eliminating communication overhead between the client and server, making the data access process both direct and efficient.

Storage Design#

In this article, we will explore the storage technology implementation of BoltDB from a macro perspective. Due to space limitations, we will provide a brief overview of BoltDB's internal mechanisms to offer a high-level understanding.

Overview of BoltDB Storage Technology Implementation:#

Data Model: BoltDB uses key-value pairs (KV) as its core data model. This model is simple, intuitive, and easy to understand and operate.
Transaction Support: BoltDB provides transaction support, allowing for atomic read and write operations. This means that all changes in a transaction either succeed or fail, ensuring data consistency.
Storage Structure: Data is stored on disk in the form of B-trees (B+ trees), which are suitable for fast lookup, insertion, and deletion operations while maintaining efficient disk utilization.
Write Optimization: To improve write performance, BoltDB uses a technique called "write-ahead optimization." Before data is written to disk, it is merged and optimized in memory, reducing disk I/O operations.
Read Performance: BoltDB's read operations directly map to the B-tree structure on disk, making read operations very fast, especially for range queries.
Data Compression: To save storage space, BoltDB compresses the stored data. This not only reduces disk usage but also improves data transfer efficiency.
Concurrency Control: Although BoltDB is designed for single-node use, it provides a certain level of concurrency control, allowing multiple goroutines to safely access the database.
Persistence and Recovery: BoltDB supports data persistence, allowing data to be recovered even after a system crash. It achieves this by regularly creating snapshots and logging changes.
Version Control: BoltDB allows users to specify a version number when opening the database, which helps maintain compatibility during database upgrades.
Security: Although BoltDB does not provide encryption features itself, it allows users to encrypt data before storing it to enhance data security.
The discussion in this article aims to provide a macro understanding of BoltDB storage technology, offering readers a framework to better understand its application and value in etcd. For readers wishing to delve deeper into its internal implementation, it is recommended to consult the official documentation and source code of BoltDB.

Read and Write#

BoltDB storage relies on disk, and the interaction for storing data is divided into read and write processes:

In the read process: Based on mmap (memory-mapping) technology, it hides the details of disk interaction, allowing users to read the contents of disk files as if accessing a byte array in memory.
In the write process: Based on pwrite + fdatasync, it implements data persistence while balancing efficiency and stability.

Storage Structure of BoltDB#

When storing data, BoltDB adopts a "page" concept similar to memory and disk swapping in operating systems. This design makes data organization and management more efficient. In BoltDB, pages are divided into several types, each serving different roles:

Meta Page: Stores metadata for BoltDB, such as version number, checksum, etc. It also includes records of globally incrementing transaction IDs, which belong to global dimension content.
Freelist Page: Records which pages are free and which pages will be released by transactions. This is similar to the heap in Go, using a space-for-time strategy to cache and manage free pages for reuse, reducing interaction with the operating system.
Branch Element Page: Stores index nodes corresponding to branch nodes in the B+ tree. These pages provide finer-grained indexing associated with specific data.
Leaf Element Page: Stores data nodes corresponding to leaf nodes in the B+ tree. These pages also provide fine-grained data storage directly related to data entities.
When the database is initialized, BoltDB creates and persists four pages:
Meta Page: Initializes two meta pages for efficiency and stability, related to the copy-on-write mechanism used by BoltDB.
Freelist Page: Initializes a global dimension free page management block.
Leaf Element Page: Acts as the root node of the B+ tree and is also a leaf node, providing a blank starting point for the database.

Implementation of B+ Tree#

The data storage of BoltDB is based on B+ trees, which are an optimized B-tree structure. B+ trees are multi-way balanced search trees characterized by storing all data in leaf nodes while internal nodes are used for indexing. BoltDB makes some modifications in implementing B+ trees:

Reference Cursor: Since leaf nodes are not directly connected by linked lists, BoltDB uses a cursor tool to assist in range retrieval, recording the movement path.
Adjustment Frequency: To balance operational efficiency and tree balance, BoltDB only performs a one-time B+ tree balance adjustment before data overflows to disk.

Concept of Buckets#

BoltDB introduces the concept of buckets to achieve isolation of business data. Buckets can be compared to tables in a database, but their form is more flexible, supporting nested topological relationships. Each database has a default root bucket, from which a multi-branch tree structure can be derived. Each bucket logically has an independent B+ tree for storing key-value pairs within that bucket's scope. This design makes data organization more flexible, facilitating management and expansion.

Transaction Processing in BoltDB#

BoltDB supports two types of transactions: read-only transactions and read-write transactions. These two transaction types differ in operational permissions and concurrency handling:

Read-Write Transactions: These transactions allow modifications to data, such as insertion, updating, and deletion. Since these operations may change the database state, read-write transactions are not idempotent. At any given time, only one read-write transaction can be executed to avoid data inconsistency caused by concurrent write operations. However, read-write transactions can be executed in parallel with multiple read-only transactions.
Read-Only Transactions: These transactions only contain query operations and do not modify database content. Therefore, read-only transactions are idempotent and can safely execute in parallel with other read-only transactions or read-write transactions without affecting data consistency.

Overview of Core Operation Processes#

In this section, we will briefly explore the processes of several core operations in BoltDB, without delving into the details of the source code:

Opening the Database: When opening a BoltDB database, a series of initialization operations are performed, including loading the meta page, freelist page, and root node page. These operations ensure that the database is in a consistent state and ready to accept subsequent read and write requests.
Starting a Transaction: When initiating a transaction, BoltDB allocates the corresponding resources based on the transaction type (read-only or read-write). For read-write transactions, it ensures that no other read-write transactions are currently executing.
Data Query: In read-only transactions, data query operations can directly access the B+ tree structure of the database to retrieve the required key-value pairs. Query operations do not lock any data, allowing for efficient parallel execution.
Data Modification: In read-write transactions, data modification operations require more complex handling. BoltDB may create new pages to store modified data and update the B+ tree structure. Upon transaction commit, these changes are persisted to disk.
Transaction Commit: Whether for read-only or read-write transactions, committing requires ensuring that all changes have been correctly processed. For read-write transactions, this may include updating the meta page, rebalancing the B+ tree, and persisting all newly created pages.
Transaction Rollback: If a transaction encounters an error during execution or is explicitly rolled back, BoltDB will undo all uncommitted changes, ensuring the consistency of the database state.
Through these core operation processes, BoltDB provides an efficient and reliable key-value storage solution suitable for applications requiring high performance and data consistency.

DB Definition#

Definition of DB
First, we introduce a core class in BoltDB—DB, which corresponds to the code abstraction of a database instance, containing the following core member attributes:

// Abstract database for boltdb
type DB struct {
    // ...
    // Database file name
    path     string
    // Method to open the file
    openFile func(string, int, os.FileMode) (*os.File, error)
    // Database file, all data is stored here
    file     *os.File
    // Database file content mapped based on mmap technology
    data     *[maxMapSize]byte
    // ...
    // Two meta pages used in rotation
    meta0    *meta
    meta1    *meta
    // Size of a single database page, in bytes
    pageSize int
    // Whether the database has been started
    opened   bool
    // Globally unique read-write transaction
    rwtx     *Tx
    // A series of read-only transactions
    txs      []*Tx
    // Freelist, managing free pages
    freelist     *freelist
    freelistLoad sync.Once
    // Object pool to improve page byte array reuse rate
    pagePool sync.Pool
    // ...
    
    // Mutex to ensure global uniqueness of read-write transactions
    rwlock   sync.Mutex   
    // Mutex to protect the meta page
    metalock sync.Mutex   
    // Read-write lock to protect mmap
    mmaplock sync.RWMutex 
    // Operation method used for data persistence, corresponding to pwrite operation
    ops struct {
        writeAt func(b []byte, off int64) (n int, err error)
    }
    // Whether the database has been started in read-only mode
    readOnly bool
}

Startup#

Main Process#

The database can be started through the open method, with the core process including:

Using the db instance and reading various options to complete the configuration.
Opening the corresponding database file using the provided path (if the file does not exist previously, a new file will be created).
If creating a new database file, it is necessary to complete the initialization of 2 meta pages, 1 freelist page, and 1 leaf element page.
Constructing the pagePool object pool for subsequent reuse of page byte arrays.
Executing mmap operations to complete the mapping of the database file and memory space.
Returning the constructed db instance.

func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
    // Construct db instance
    db := &DB{
        opened: true,
    }
    // Enable default configuration
    if options == nil {
        options = DefaultOptions
    }
    
    // ...
    // Default not to enable read-only mode
    if options.ReadOnly {
        flag = os.O_RDONLY
        db.readOnly = true
    } else {
        // always load free pages in write mode
        db.PreLoadFreelist = true
    }


    // Method to open the database file
    db.openFile = options.OpenFile
    if db.openFile == nil {
        db.openFile = os.OpenFile
    }


    // Open the database file
    var err error
    if db.file, err = db.openFile(path, flag|os.O_CREATE, mode); err != nil {
        _ = db.close()
        return nil, err
    }
    // Assign the database file name
    db.path = db.file.Name()


    // ...
    // Data persistence operation
    db.ops.writeAt = db.file.WriteAt


    // Size of data pages
    if db.pageSize = options.PageSize; db.pageSize == 0 {
        // Default equals the operating system page size
        db.pageSize = defaultPageSize
    }


    // If creating a new db file from scratch, initialization is required
    if info, err := db.file.Stat(); err != nil {
        _ = db.close()
        return nil, err
    } else if info.Size() == 0 {
        // Initialize db
        if err := db.init(); err != nil {
            // ...
            _ = db.close()
            return nil, err
        }
    } 
    // ...


    // Object pool for reusing page byte arrays
    db.pagePool = sync.Pool{
        New: func() interface{} {
            return make([]byte, db.pageSize)
        },
    }


    // Establish mapping between the database file and memory space based on mmap
    if err := db.mmap(options.InitialMmapSize); err != nil {
        _ = db.close()
        return nil, err
    }


    // Preload freelist
    if db.PreLoadFreelist {
        db.loadFreelist()
    }


    // ...
    return db, nil
}

Mmap#

The mmap operation establishes a mapping between the data file and memory, with core steps including:

Locking to ensure concurrent safety of mmap operations.
Setting an appropriate mmap space size.
If mmap has been executed previously, performing cleanup.
Executing a new round of mmap operations.

func (db *DB) mmap(minsz int) (err error) {
    // Mutex to protect mmap concurrency safety
    db.mmaplock.Lock()
    defer db.mmaplock.Unlock()


    info, err := db.file.Stat()
    // ...


    // Adjusting appropriate mmap capacity
    fileSize := int(info.Size())
    var size = fileSize
    if size < minsz {
        size = minsz
    }
    size, err = db.mmapSize(size)
    if err != nil {
        return err
    }


    // ...


    // If there are read-write transactions running, the bucket content needs to be reshaped due to mmap operations
    if db.rwtx != nil {
        db.rwtx.root.dereference()
    }


    // Unmap the previously established mmap mapping
    if err = db.munmap(); err != nil {
        return err
    }


    // Establish a new mmap mapping
    if err = mmap(db, size); err != nil {
        return err
    }


    // ...


    return nil
}

Note: The mmap operation is implemented through system calls, and different operating systems will have different implementation details.

Creating Buckets (Tables)#

A bucket is essentially a special KV pair of data belonging to its parent bucket's B+ tree. Therefore, the process of creating a bucket is similar to writing KV data:

Using a cursor, find the position of the parent bucket's B+ tree corresponding to the bucket key.
Create a child bucket instance and obtain the serialized result.
Insert the bucket name as the key and the serialized result of the bucket as the value into the parent bucket's B+ tree as a KV pair.

func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
    // ...
    // Get cursor
    c := b.Cursor()
    
    // Use the cursor to find the position corresponding to the bucket name key
    k, _, flags := c.seek(key)


    // Bucket already exists
    if bytes.Equal(key, k) {
        if (flags & bucketLeafFlag) != 0 {
            return nil, ErrBucketExists
        }
        return nil, ErrIncompatibleValue
    }


    // Create a new bucket instance
    var bucket = Bucket{
        bucket:      &bucket{},
        rootNode:    &node{isLeaf: true},
        FillPercent: DefaultFillPercent,
    }
    
    // Obtain the serialized result of the bucket
    var value = bucket.write()


    // Write the KV pair data corresponding to this new bucket into the B+ tree
    key = cloneBytes(key)
    c.node().put(key, key, value, 0, bucketLeafFlag)


    // ...
    // Return the newly created bucket
    return b.Bucket(key), nil
}

Retrieving Buckets (Tables)#

The process of retrieving a bucket by name is somewhat similar to querying data:

Check the cache map of the parent bucket; if the child bucket has been deserialized, reuse it directly.
Use a cursor to search for the KV pair of the child bucket in the parent bucket's B+ tree.
Deserialize the KV data to generate the bucket instance.
Add the child bucket to the cache map of the parent bucket.
Return the retrieved child bucket.

func (b *Bucket) Bucket(name []byte) *Bucket {
    // If the corresponding bucket is already cached in the map, return it directly
    if b.buckets != nil {
        if child := b.buckets[string(name)]; child != nil {
            return child
        }
    }


    // Use a cursor to search for the KV pair in the B+ tree
    c := b.Cursor()
    k, v, flags := c.seek(name)


    // ...
    // After finding the bucket, deserialize it
    var child = b.openBucket(v)
    // Cache it in the map
    if b.buckets != nil {
        b.buckets[string(name)] = child
    }
    // Return the bucket
    return child
}

Data Persistence#

When BoltDB commits read-write transactions, it writes the updated dirty data to disk in one go:

Through rebalance and spill operations, it ensures that the B+ tree's balance requirements are met.
Executes pwrite + fdatasync operations to complete the persistence of dirty data pages.
Recovers the byte arrays used to point to these pages through the page pool.
Since the transaction progress has been updated, the meta page also needs to be written to disk.
Closes the read-write transaction.

func (tx *Tx) Commit() error {
    // ...
    
    // Before writing dirty data to disk, a round of B+ tree adjustment is needed to ensure its balance
    // rebalance is to avoid certain nodes having too few KV pairs due to delete operations, which do not meet B+ tree balance requirements
    tx.root.rebalance()
    // ...
 
    // spill is to avoid certain nodes having too many KV pairs due to put operations, which do not meet B+ tree balance requirements
    if err := tx.root.spill(); err != nil {
        tx.rollback()
        return err
    }
    
    
    // Write the dirty data updated by the transaction to disk
    if err := tx.write(); err != nil {
        tx.rollback()
        return err
    }


    // ...


    // Write the meta page to disk
    if err := tx.writeMeta(); err != nil {
        tx.rollback()
        return err
    }
    // ...


    // Close the transaction
    tx.close()
    // ...


    return nil
}

Dirty page writing to disk

func (tx *Tx) write() error {
    // Dirty pages cached by the transaction
    pages := make(pages, 0, len(tx.pages))
    for _, p := range tx.pages {
        pages = append(pages, p)
    }
    // Clear the cache
    tx.pages = make(map[pgid]*page)
    // Sort the dirty pages
    sort.Sort(pages)


    // Write the dirty pages to disk in order
    for _, p := range pages {
        // Total size of the page, including overflow
        rem := (uint64(p.overflow) + 1) * uint64(tx.db.pageSize)
        // Offset of the page, can be calculated based on page id
        offset := int64(p.id) * int64(tx.db.pageSize)
        var written uintptr


        // Write out page in "max allocation" sized chunks.
        for {
            sz := rem
            if sz > maxAllocSize-1 {
                sz = maxAllocSize - 1
            }
            buf := unsafeByteSlice(unsafe.Pointer(p), written, 0, int(sz))
            // Write the page to the corresponding offset in the file
            if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
                return err
            }


            rem -= sz
            // Finished writing in one go
            if rem == 0 {
                break
            }


            // If not finished writing in one go, continue writing in the next round
            offset += int64(sz)
            written += uintptr(sz)
        }
    }


    // fdatasync operation to ensure that the data writing to disk is complete
    if !tx.db.NoSync || IgnoreNoSync {
        if err := fdatasync(tx.db); err != nil {
            return err
        }
    }


    // Release the pages that have been written to disk; if they do not have overflow, indicating they are standard byte arrays, clear the content and add them to the object pool for reuse
    for _, p := range pages {
        // Ignore page sizes over 1 page.
        // These are allocated using make() instead of the page pool.
        if int(p.overflow) != 0 {
            continue
        }


        buf := unsafeByteSlice(unsafe.Pointer(p), 0, 0, tx.db.pageSize)


        // See https://go.googlesource.com/go/+/f03c9202c43e0abb130669852082117ca50aa9b1
        for i := range buf {
            buf[i] = 0
        }
        tx.db.pagePool.Put(buf) //nolint:staticcheck
    }


    return nil
}

Conclusion#

This article provides a high-level overview of the architecture and key features of BoltDB. We explored the storage structure of BoltDB, including different types of pages (such as meta pages, freelist pages, branch element pages, and leaf element pages) and their roles in the database. Additionally, we introduced the transaction processing mechanism of BoltDB, distinguishing between read-only transactions and read-write transactions, and briefly explained their concurrency handling methods.
Furthermore, we examined the implementation of B+ trees in BoltDB from a macro perspective, including optimizations made to B-trees and adjustments in practical applications. We also mentioned the concept of buckets in BoltDB, which provides flexible isolation and organization of data.
Although the content of this article is relatively superficial, it provides readers with a framework to better understand the role of BoltDB in distributed storage systems. In the future, we can delve into the specific implementation details of these core concepts and how they work together to provide efficient and reliable data storage solutions. With a deeper understanding of BoltDB, we can further discuss its performance optimization in practical applications, fault recovery strategies, and integration with other distributed systems.