Skip to content

Commit b04befa

Browse files
committed
Initial commit & code skeleton
No actual functionality yet.
1 parent c1b9615 commit b04befa

8 files changed

+250
-2
lines changed

ISSUE_TEMPLATE.md

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
[provide general introduction to the issue and why it is relevant to this repository]
2+
3+
## Context of the issue
4+
5+
[provide more detailed introduction to the issue itself and why it is relevant]
6+
7+
[the remaining entries are only necessary if you are reporting a bug]
8+
9+
## Process to reproduce the issue
10+
11+
[ordered list the process to finding and recreating the issue, example below]
12+
13+
1. User creates TPOT instance
14+
2. User calls TPOT `fit()` function with training data
15+
3. TPOT crashes with a `KeyError` after 5 generations
16+
17+
## Expected result
18+
19+
[describe what you would expect to have resulted from this process]
20+
21+
## Current result
22+
23+
[describe what you currently experience from this process, and thereby explain the bug]
24+
25+
## Possible fix
26+
27+
[not necessary, but suggest fixes or reasons for the bug]
28+
29+
## `name of issue` screenshot
30+
31+
[if relevant, include a screenshot]

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include README.md LICENSE
2+
recursive-include datacleaner *.py

PULL_REQUEST_TEMPLATE.md

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
## What does this PR do?
2+
3+
4+
5+
## Where should the reviewer start?
6+
7+
8+
9+
## How should this PR be tested?
10+
11+
12+
13+
## Any background context you want to provide?
14+
15+
16+
17+
## What are the relevant issues?
18+
19+
[you can link directly to issues by entering # then the number of the issue, for example, #3 links to issue 3]
20+
21+
## Screenshots (if appropriate)
22+
23+
24+
25+
## Questions:
26+
27+
- Do the docs need to be updated?
28+
- Does this PR add new (Python) dependencies?

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ A Python tool that automatically cleans data sets and readies them for analysis.
44

55
## datacleaner is not magic
66

7-
datacleaner works with CSV files only (with any regular delimiter, such as tabs, spaces, or commas).
7+
datacleaner works with data in [pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).
88

99
datacleaner is not magic, and it won't take an unorganized blob of text and automagically parse it out for you.
1010

11-
What datacleaner *will* do is save you a ton of time encoding and cleaning your data once it's already in CSV format.
11+
What datacleaner *will* do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.
1212

1313
## License
1414

datacleaner/__init__.py

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# -*- coding: utf-8 -*-
2+
3+
"""
4+
Copyright (c) 2016 Randal S. Olson
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software
7+
and associated documentation files (the "Software"), to deal in the Software without restriction,
8+
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
9+
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
10+
subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all copies or substantial
13+
portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
16+
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
18+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
19+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20+
"""
21+
22+
from ._version import __version__
23+
from .datacleaner import datacleaner, main

datacleaner/_version.py

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# -*- coding: utf-8 -*-
2+
3+
"""
4+
Copyright (c) 2016 Randal S. Olson
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software
7+
and associated documentation files (the "Software"), to deal in the Software without restriction,
8+
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
9+
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
10+
subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all copies or substantial
13+
portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
16+
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
18+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
19+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20+
"""
21+
22+
__version__ = '0.1'

datacleaner/datacleaner.py

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# -*- coding: utf-8 -*-
2+
3+
"""
4+
Copyright (c) 2016 Randal S. Olson
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software
7+
and associated documentation files (the "Software"), to deal in the Software without restriction,
8+
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
9+
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
10+
subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all copies or substantial
13+
portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
16+
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
17+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
18+
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
19+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20+
"""
21+
22+
from __future__ import print_function
23+
import pandas as pd
24+
import argparse
25+
26+
def autoclean(input_dataframe):
27+
"""Performs a series of automated data cleaning transformations on the provided data set
28+
29+
Parameters
30+
----------
31+
input_dataframe: pandas.DataFrame
32+
Data set to clean
33+
34+
Returns
35+
----------
36+
output_dataframe: pandas.DataFrame
37+
Cleaned data set
38+
39+
"""
40+
return
41+
42+
def autoclean_cv(training_dataframe, testing_dataframe):
43+
"""Performs a series of automated data cleaning transformations on the provided training and testing data sets
44+
45+
Unlike `autoclean()`, this function takes cross-validation into account by learning the data transformations from only the training set, then
46+
applying those transformations to both the training and testing set. By doing so, this function will prevent information leak from the
47+
training set into the testing set.
48+
49+
Parameters
50+
----------
51+
training_dataframe: pandas.DataFrame
52+
Training data set
53+
54+
testing_dataframe: pandas.DataFrame
55+
Testing data set
56+
57+
Returns
58+
----------
59+
output_training_dataframe: pandas.DataFrame
60+
Cleaned training data set
61+
62+
output_testing_dataframe: pandas.DataFrame
63+
Cleaned testing data set
64+
65+
"""
66+
return
67+
68+
def main():
69+
"""Main function that is called when datacleaner is run on the command line"""
70+
from _version import __version__
71+
72+
parser = argparse.ArgumentParser(description='A Python tool that automatically cleans data sets and readies them for analysis')
73+
74+
parser.add_argument('INPUT_FILENAME', type=str, help='Data file to clean')
75+
76+
parser.add_argument('-o', action='store', dest='OUTPUT_FILENAME', default=None,
77+
type=str, help='Data file to output to')
78+
79+
parser.add_argument('-is', action='store', dest='INPUT_SEPARATOR', default='\t',
80+
type=str, help='Column separator for the input file (default: \\t)')
81+
82+
parser.add_argument('-os', action='store', dest='OUTPUT_SEPARATOR', default='\t',
83+
type=str, help='Column separator for the output file (default: \\t)')
84+
85+
parser.add_argument('--version', action='version',
86+
version='datacleaner v{version}'.format(version=__version__))
87+
88+
args = parser.parse_args()
89+
90+
91+
92+
if __name__ == '__main__':
93+
main()

setup.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
from setuptools import setup, find_packages
4+
5+
def calculate_version():
6+
initpy = open('datacleaner/_version.py').read().split('\n')
7+
version = list(filter(lambda x: '__version__' in x, initpy))[0].split('\'')[1]
8+
return version
9+
10+
package_version = calculate_version()
11+
12+
setup(
13+
name='datacleaner',
14+
version=package_version,
15+
author='Randal S. Olson',
16+
author_email='[email protected]',
17+
packages=find_packages(),
18+
url='https://github.com/rhiever/datacleaner',
19+
license='License :: OSI Approved :: MIT License',
20+
entry_points={'console_scripts': ['datacleaner=datacleaner:main', ]},
21+
description=('A Python tool that automatically cleans data sets and readies them for analysis.'),
22+
long_description='''
23+
A Python tool that automatically cleans data sets and readies them for analysis.
24+
25+
Contact
26+
=============
27+
If you have any questions or comments about datacleaner, please feel free to contact me via:
28+
29+
30+
31+
or Twitter: https://twitter.com/randal_olson
32+
33+
This project is hosted at https://github.com/rhiever/datacleaner
34+
''',
35+
zip_safe=True,
36+
classifiers=[
37+
'Intended Audience :: Developers',
38+
'Intended Audience :: Information Technology',
39+
'Intended Audience :: Science/Research',
40+
'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
41+
'Programming Language :: Python :: 2',
42+
'Programming Language :: Python :: 2.7',
43+
'Programming Language :: Python :: 3',
44+
'Programming Language :: Python :: 3.4',
45+
'Programming Language :: Python :: 3.5',
46+
'Topic :: Utilities'
47+
],
48+
keywords=['data cleaning', 'csv', 'machine learning', 'data analysis', 'data engineering'],
49+
)

0 commit comments

Comments
 (0)