Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory-mapped utilility module #189

Merged
merged 4 commits into from
Dec 19, 2022

Conversation

weiliw-amz
Copy link
Contributor

@weiliw-amz weiliw-amz commented Dec 7, 2022

Issue #, if available:
N/A

Description of changes:
Add memory-mapped utilility module.

User could test with below code: Copy it into a filetest_mmap_util.cpp placed at pecos/core/util/, and run:

gcc -lm -ldl -lstdc++ -fopenmp -std=c++14 -lgcc -lgomp -O3  -I ./ test_mmap_util.cpp
./a.out

Output:

Generate a Bar with data:
---Bar---
---Foo---
foo_1: 0 1 2 3 4 5 6 7 8 9 
foo_2: 1
---------
bar: 5 5 5 5 5 
---------
Save Bar into mmap file: ./bar_test_mmap.txt
Load a new Bar from saved mmap file...
Loaded Bar data:
---Bar---
---Foo---
foo_1: 0 1 2 3 4 5 6 7 8 9 
foo_2: 1
---------
bar: 5 5 5 5 5 
---------
#include <iostream>
#include "mmap_util.hpp"

using namespace pecos::mmap_util;

// Nested class mmap example
// Bar contains a Foo instance
class Foo {
    public:
        Foo() {}
        ~Foo() { foo_1.clear(); }

        void init_data() {
            foo_1.resize(10, 0);
            for (int i=0; i<foo_1.size(); ++i) { foo_1[i] = i; }
            foo_2 = 1.0;
        }

        void print() {
            std::cout << "---Foo---" << std::endl;
            std::cout << "foo_1: ";
            for (int i=0; i<foo_1.size(); ++i) { std::cout << foo_1[i] << " "; }
            std::cout << std::endl;
            std::cout << "foo_2: " << foo_2 << std::endl;
            std::cout << "---------" << std::endl;
        }

        void save_to_mmap_store(MmapStore& mmap_s) const {
            foo_1.save_to_mmap_store(mmap_s);
            mmap_s.fput_one<double>(foo_2);
        }

        void load_from_mmap_store(MmapStore& mmap_s) {
            foo_1.load_from_mmap_store(mmap_s);
            foo_2 = mmap_s.fget_one<double>();
        }

    private:
        MmapableVector<int> foo_1;
        double foo_2;
};

class Bar {
    public:
        Bar() { }
        ~Bar() { bar.clear(); mmap_store.close(); }

        void init_data() {
            foo.init_data();
            bar.resize(5, 0);
            for (int i=0; i<bar.size(); ++i) { bar[i] = 5.0; }
        }

        void print() {
            std::cout << "---Bar---" << std::endl;
            foo.print();
            std::cout << "bar: ";
            for (int i=0; i<bar.size(); ++i) { std::cout << bar[i] << " "; }
            std::cout << std::endl;
            std::cout << "---------" << std::endl;
        }

        void save(const std::string & file_name) const {
            // Create a mmapfile for dump at the most outer layer class
            // You cannot reuse (i.e, close and reopen) mmap_store, since it may hold the data storage
            MmapStore mmap_s = MmapStore();
            mmap_s.open(file_name, "w");

            save_to_mmap_store(mmap_s);

            // Metadata dump and fp closure is automatically done at MmapStore destructor when this function ends
            // You can make it happen earlier with explicitly calling close()
            mmap_s.close();
        }
        void load(const std::string & file_name, const bool pre_load=true) {
            mmap_store.open(file_name, pre_load?"r":"r_lazy");
            load_from_mmap_store(mmap_store);
        }

        void save_to_mmap_store(MmapStore& mmap_s) const {
            foo.save_to_mmap_store(mmap_s);
            bar.save_to_mmap_store(mmap_s);
        }

        void load_from_mmap_store(MmapStore& mmap_s) {
            foo.load_from_mmap_store(mmap_s);
            bar.load_from_mmap_store(mmap_s);
        }

    private:
        Foo foo;
        MmapableVector<double> bar;
        // Mmap Data storage at the most outer layer class
        MmapStore mmap_store;
};


int main() {
    std::string f_name = "./bar_test_mmap.txt";

    std::cout << "Generate a Bar with data:" << std::endl;
    Bar bar;
    bar.init_data();
    bar.print();

    std::cout << "Save Bar into mmap file: " << f_name << std::endl;
    bar.save(f_name);

    std::cout << "Load a new Bar from saved mmap file..." << std::endl;
    Bar new_bar;
    new_bar.load(f_name, "r");

    std::cout << "Loaded Bar data:" << std::endl;
    new_bar.print();

    return 0;
}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* n_bytes_to_pad (uint64_t): Number of bytes to pad before saving the new block.
*/
uint64_t aligned_append(const uint64_t size) {
const uint64_t align_bytes = 16; // All blocks are aligned to multiple of the value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should align_bytes be declared here or in _Signature

* n_elements (uint64_t): Number of elements to load for the array.
*/
template<class T, class TT=T, if_simple_serializable<TT> = true>
T * fget_multiple(uint64_t n_elements) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have fget_next when user don't know n_elements beforehand?

_mode = _Mode::WRITEONLY;
}
else {
throw std::runtime_error("Unrecogonized mode. Should be either 'r' or 'w'.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed rp

/* Load one element */
template<class T>
T fget_one() { // Call fget_multiple
return * fget_multiple<T>(1u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no implementation of fget_multiple() when T is not simple_serializable.

Comment on lines 295 to 301
auto n_bytes_to_pad = _metadata.aligned_append(sizeof(T) * n_elements);

// Pad previous block end with dummy bytes
const char dummy = '\0';
for(auto i = 0u; i < n_bytes_to_pad; i++){
file_util::fput_one(dummy, _fp);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be implemented into alighed_append?

/* Functions to match std::vector interface */
uint64_t size() const { return _size; }

const T* data() const { return _data; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need both

const T* data() const { return _data; }
T* data() { return data; }

?

mmap_f.fput_multiple<T>(_data, _size);
}

void load_mmap(MmapFile & mmap_f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar name as _load_mmap but different concept, consider change to from_mmap? Similarly change save_mmap to to_mmap?

}

void load_mmap(MmapFile & mmap_f) {
clear(); // Clean any previous storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to always clear or throw an error if non empty?

@weiliw-amz
Copy link
Contributor Author

Addressed all comments in newest commit.


bool empty() const { return static_cast<bool> (size_ == 0); }

void resize(uint64_t size, const T& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a default value in the signature so it can handle both cases.
void resize(uint64_t size, const T& value=T() )

template<class T, class TT = T, details_::if_simple_serializable<TT> = true>
class MmapableVector {
public:
MmapableVector() {} // Dummy constructor for empty case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this class will be used as a drop-in replacement of std::vector with limited operations. Do we want to consider to follow the rule of five pattern for this class? https://en.cppreference.com/w/cpp/language/rule_of_three

/*
* Module Private Class to store signature and metadata of all data blocks in mainbody for a memory-mapped file.
*/
class MmapMetadata_ {
Copy link
Contributor

@rofuyu rofuyu Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to have a trailing _ for private members (variables, or methods) but not for the class definition given that all the classes are already wrapped in a privately hinted details_ namespace.

* Return:
* n_bytes_to_pad (uint64_t): Number of bytes to pad before saving the new block.
*/
uint64_t get_n_bytes_to_pad(const uint64_t size) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not dependency on the input argument "size" in the entire implementation. Do we need this argument at all? if not, we can drop it and potentially rename the method to something like get_num_of_padding_to_align(const uint64_t aligned_bytes=N_ALIGN_BYTES_)

/* Save metadata to end of file after all data blocks have been saveed.
* Also, file pointer is closed at the end.
*/
void finalize_() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't plan to reveal this method and only use it once in the destructor, I would suggest to move the content into the destructor itself.

}

/* Free memory-mapped region */
void free_mmap_() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't plan to reveal this method and only use it once in the destructor, I would suggest to move the content into the destructor itself.

Comment on lines 395 to 396
}
else if (mode_ == Mode_::READONLY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's merge these two lines into one } else if {, which is used as a convection in other C++ codes in the library.

@weiliw-amz
Copy link
Contributor Author

Addressed all comments in newest commit.

uint8_t loaded_magic[sizeof(MAGIC)];
file_util::fget_multiple<uint8_t>(&loaded_magic[0], sizeof(MAGIC), fp);
if(std::memcmp(&MAGIC[0], &loaded_magic[0], sizeof(MAGIC)) != 0) {
throw std::runtime_error("File is not a valid PECOS MMAP file");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing period of the sentence, it's super minor though.

* Once loaded as mmap view, it cannot go back to std::vector case unless clear() is called.
*/
template<class T, class TT = T, details_::if_simple_serializable<TT> = true>
class MmapableVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to also implement push_back and emplace_back at the moment. The vectorizer and sparse matrix currently have struct methods with vectors using them.

@weiliw-amz
Copy link
Contributor Author

Addressed all comments in newest update.

@weiliw-amz weiliw-amz merged commit 1175b33 into amzn:mainline Dec 19, 2022
@weiliw-amz weiliw-amz deleted the memory-mapped-utility-module branch December 19, 2022 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants