Skip to content

Data generator with multiple file output #105

@noproblem666

Description

@noproblem666

We have a data generator for a KMeans benchmark and want to use it with the PEEL framework.
The generator produces 2 files, points and centers and run as a flink job. We want to save these files in <hdfs-root-directory >/kmeans using the GeneratedDataSet class and then pick these files with the KMeans flink job.

My question is: How can we configure PEEL to create the directory kmeans in HDFS and then copy the files to that directory? With our current configuration shown below that does not work.

     <!--************************************************************************
    * Data Generators
    *************************************************************************-->

    <bean id="datagen.kmeans" class="org.peelframework.flink.beans.job.FlinkJob">
        <constructor-arg name="runner" ref="flink-1.0.3"/>
        <constructor-arg name="command">
            <value><![CDATA[
              -v -c org.apache.flink.examples.java.clustering.util.KMeansDataGenerator  \
              ${app.path.datagens}/KMeans.jar                                                                                   \
              --points ${datagen.points}                                                                                                \
              --k ${datagen.k}                                                                                                                  \
               --output ${system.hadoop-2.path.input}/kmeans
            ]]>
            </value>
        </constructor-arg>
    </bean>

    <!--************************************************************************
    * Data Sets
    *************************************************************************-->

        <bean id="dataset.kmeans.generated" class="org.peelframework.core.beans.data.GeneratedDataSet">
        <constructor-arg name="src" ref="datagen.kmeans"/>
        <constructor-arg name="dst" value="${system.hadoop-2.path.input}/kmeans"/>
        <constructor-arg name="fs" ref="hdfs-2.7.1"/>
    </bean>

The usage of our data generator is similar to the WordGenetator except that it produces 2 files instead of just one.

Do you have an idea how we could solve this problem with PEEL or do we have to adjust our data generator?

Thanks!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions