d6tstack package

Submodules

d6tstack.combine_csv module

class d6tstack.combine_csv.CombinerCSV(fname_list, sep=', ', nrows_preview=3, chunksize=1000000.0, read_csv_params=None, columns_select=None, columns_select_common=False, columns_rename=None, add_filename=True, apply_after_read=None, log=True, logger=None)[source]

Bases: object

Core combiner class. Sniffs columns, generates preview, combines aka stacks to various output formats.

Parameters:
  • fname_list (list) – file names, eg [‘a.csv’,’b.csv’]
  • sep (string) – CSV delimiter, see pandas.read_csv()
  • has_header (boolean) – data has header row
  • nrows_preview (int) – number of rows in preview
  • chunksize (int) – number of rows to read into memory while processing, see pandas.read_csv()
  • read_csv_params (dict) – additional parameters to pass to pandas.read_csv()
  • columns_select (list) – list of column names to keep
  • columns_select_common (bool) – keep only common columns. Use this instead of columns_select
  • columns_rename (dict) – dict of columns to rename `{‘name_old’:’name_new’}
  • add_filename (bool) – add filename column to output data frame. If False, will not add column.
  • apply_after_read (function) – function to apply after reading each file. needs to return a dataframe
  • log (bool) – send logs to logger
  • logger (object) – logger object with send_log()
combine_preview()[source]

Preview of what the combined data will look like

Returns:combined dataframe
Return type:dataframe
get_sniff_results()[source]
is_all_equal()[source]

Checks if all columns are equal in all files

Returns:all columns are equal in all files?
Return type:bool
is_column_present()[source]

Shows which columns are present in which files

Returns:boolean values for column presence in each file
Return type:dataframe
is_column_present_common()[source]

Shows common columns by file

Returns:boolean values for column presence in each file
Return type:dataframe
is_column_present_unique()[source]

Shows unique columns by file

Returns:boolean values for column presence in each file
Return type:dataframe
preview_rename()[source]

Shows which columns will be renamed in processing

Returns:columns to be renamed from which file
Return type:dataframe
preview_select()[source]

Shows which columns will be selected in processing

Returns:columns to be selected from all files
Return type:list
sniff_columns()[source]

Checks column consistency by reading top nrows in all files. It checks both presence and order of columns in all files

Returns:
results dictionary with
files_columns (dict): dictionary with information, keys = filename, value = list of columns in file columns_all (list): all columns in files columns_common (list): only columns present in every file is_all_equal (boolean): all files equal in all files? df_columns_present (dataframe): which columns are present in which file? df_columns_order (dataframe): where in the file is the column?
Return type:dict
to_csv_align(output_dir=None, output_prefix='d6tstack-', write_params={})[source]

Create cleaned versions of original files. Automatically runs out of core, using self.chunksize.

Parameters:
  • output_dir (str) – directory to save files in. If not given save in the same directory as the original file
  • output_prefix (str) – prepend with prefix to distinguish from original files
  • write_params (dict) – additional params to pass to pandas.to_csv()
Returns:

list of filenames of processed files

Return type:

list

to_csv_combine(filename, write_params={})[source]

Combines all files to a single csv file. Automatically runs out of core, using self.chunksize.

Parameters:
  • filename (str) – file names
  • write_params (dict) – additional params to pass to pandas.to_csv()
Returns:

filename for combined data

Return type:

str

to_mssql_combine(uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv')[source]

Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption

Parameters:
  • uri (str) – mysql mysqlconnector sqlalchemy database uri
  • table_name (str) – table to store data in
  • schema_name (str) – name of schema to write to
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • tmpfile (str) – filename for temporary file to load from
Returns:

True if loader finished

Return type:

bool

to_mysql_combine(uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=', ')[source]

Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption

Parameters:
  • uri (str) – mysql mysqlconnector sqlalchemy database uri
  • table_name (str) – table to store data in
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • tmpfile (str) – filename for temporary file to load from
  • sep (str) – separator for temp file, eg ‘,’ or ‘ ‘
Returns:

True if loader finished

Return type:

bool

to_pandas()[source]

Combine all files to a pandas dataframe

Returns:combined data
Return type:dataframe
to_parquet_align(output_dir=None, output_prefix='d6tstack-', write_params={})[source]

Same as to_csv_align but outputs parquet files

to_parquet_combine(filename, write_params={})[source]

Same as to_csv_combine but outputs parquet files

to_psql_combine(uri, table_name, if_exists='fail', sep=', ')[source]

Load all files into a sql table using native postgres COPY FROM. Chunks data load to reduce memory consumption

Parameters:
  • uri (str) – postgres psycopg2 sqlalchemy database uri
  • table_name (str) – table to store data in
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • sep (str) – separator for temp file, eg ‘,’ or ‘ ‘
Returns:

True if loader finished

Return type:

bool

to_sql_combine(uri, tablename, if_exists='fail', write_params=None, return_create_sql=False)[source]

Load all files into a sql table using sqlalchemy. Generic but slower than the optmized functions

Parameters:
  • uri (str) – sqlalchemy database uri
  • tablename (str) – table to store data in
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • write_params (dict) – additional params to pass to pandas.to_sql()
  • return_create_sql (dict) – show create sql statement for combined file schema. Doesn’t run data load
Returns:

True if loader finished

Return type:

bool

d6tstack.convert_xls module

class d6tstack.convert_xls.XLSSniffer(fname_list, logger=None)[source]

Bases: object

Extracts available sheets from MULTIPLE Excel files and runs diagnostics

Parameters:
  • fname_list (list) – file paths, eg [‘dir/a.csv’,’dir/b.csv’]
  • logger (object) – logger object with send_log(), optional
all_contain_sheetname(sheet_name)[source]

Check if all files contain a certain sheet

Parameters:sheet_name (string) – sheetname to check
Returns:If true
Return type:boolean
all_have_idx(sheet_idx)[source]

Check if all files contain a certain index

Parameters:sheet_idx (string) – index to check
Returns:If true
Return type:boolean
all_same_count()[source]

Check if all files contain the same number of sheets

Parameters:sheet_idx (string) – index to check
Returns:If true
Return type:boolean
all_same_names()[source]
sniff()[source]

Executes sniffer

Returns:True if everything ok. Results are accessible in .df_xls_sheets
Return type:boolean
class d6tstack.convert_xls.XLStoBase(if_exists='skip', output_dir=None, logger=None)[source]

Bases: object

convert_single(fname, sheet_name, **kwds)[source]

Converts single file

Parameters:
  • fname – path to file
  • sheet_name (str) – optional sheet_name to override global cfg_xls_sheets_sel
  • as d6tstack.utils.read_excel_advanced() (Same) –
Returns:

output file names

Return type:

list

class d6tstack.convert_xls.XLStoCSVMultiFile(fname_list, cfg_xls_sheets_sel_mode='idx_global', cfg_xls_sheets_sel=0, output_dir=None, if_exists='skip', logger=None)[source]

Bases: d6tstack.convert_xls.XLStoBase

Converts xls|xlsx files to csv files. Selects a SINGLE SHEET from each file. To extract MULTIPLE SHEETS from a file use XLStoCSVMultiSheet

Parameters:
  • fname_list (list) – file paths, eg [‘dir/a.csv’,’dir/b.csv’]
  • cfg_xls_sheets_sel_mode (string) –

    mode to select tabs

    • name: select by name, provide name for each file, can customize by file
    • name_global: select by name, one name for all files
    • idx: select by index, provide index for each file, can customize by file
    • idx_global: select by index, one index for all files
  • cfg_xls_sheets_sel (dict) – values to select tabs {‘filename’:’value’}
  • output_dir (str) – If present, file is saved in given directory, optional
  • if_exists (str) – Possible values: skip and replace, default: skip, optional
  • logger (object) – logger object with send_log(‘msg’,’status’), optional
convert_all(**kwds)[source]

Converts all files

Parameters:parameters for d6tstack.utils.read_excel_advanced() (Any) –
Returns:output file names
Return type:list
set_files(fname_list)[source]

Update input files. You will also need to update sheet selection with .set_select_mode().

Parameters:fname_list (list) – see class description for details
set_select_mode(cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel)[source]

Update sheet selection values

Parameters:
  • cfg_xls_sheets_sel_mode (string) – see class description for details
  • cfg_xls_sheets_sel (list) – see class description for details
class d6tstack.convert_xls.XLStoCSVMultiSheet(fname, sheet_names=None, output_dir=None, if_exists='skip', logger=None)[source]

Bases: d6tstack.convert_xls.XLStoBase

Converts ALL SHEETS from a SINGLE xls|xlsx files to separate csv files

Parameters:
  • fname (string) – file path
  • sheet_names (list) – list of int or str. If not given, will convert all sheets in the file
  • output_dir (str) – If present, file is saved in given directory, optional
  • if_exists (str) – Possible values: skip and replace, default: skip, optional
  • logger (object) – logger object with send_log(‘msg’,’status’), optional
convert_all(**kwds)[source]

Converts all files

Parameters:parameters for d6tstack.utils.read_excel_advanced() (Any) –
Returns:output file names
Return type:list
convert_single(sheet_name, **kwds)[source]

Converts all files

Parameters:
  • sheet_name (str) – Excel sheet
  • parameters for d6tstack.utils.read_excel_advanced() (Any) –
Returns:

output file name

Return type:

str

d6tstack.convert_xls.read_excel_advanced(fname, remove_blank_cols=True, remove_blank_rows=True, collapse_header=True, header_xls_range=None, header_xls_start=None, header_xls_end=None, is_preview=False, nrows_preview=3, **kwds)[source]

Read Excel files to pandas dataframe with advanced options like set header ranges and remove blank columns and rows

Parameters:
  • fname (str) – Excel file path
  • remove_blank_cols (bool) – remove blank columns
  • remove_blank_rows (bool) – remove blank rows
  • collapse_header (bool) – to convert multiline header to a single line string
  • header_xls_range (string) – range of headers in excel, eg: A4:B16
  • header_xls_start (string) – Starting cell of excel for header range, eg: A4
  • header_xls_end (string) – End cell of excel for header range, eg: B16
  • is_preview (bool) – Read only first nrows_preview lines
  • nrows_preview (integer) – Initial number of rows to be used for preview columns (default: 3)
  • kwds (mixed) – parameters for pandas.read_excel() to pass through
Returns:

pandas dataframe

Return type:

df (dataframe)

Note

You can pass in any pandas.read_excel() parameters in particular sheet_name

d6tstack.helpers module

Module with several helper functions

d6tstack.helpers.check_valid_xls(fname_list)[source]
d6tstack.helpers.cols_filename_tofront(_list)[source]
d6tstack.helpers.columns_all_equal(col_list)[source]

Checks that all lists in col_list are equal.

Parameters:col_list (list) – columns, eg [[‘a’,’b’],[‘a’,’b’,’c’]]
Returns:all lists in list are equal?
Return type:bool
d6tstack.helpers.compare_pandas_versions(version1, version2)[source]
d6tstack.helpers.df_filename_tofront(dfg)[source]
d6tstack.helpers.file_extensions_all_equal(ext_list)[source]

Checks that all file extensions are equal.

Parameters:ext_list (list) – file extensions, eg [‘.csv’,’.csv’]
Returns:all extensions are equal to first extension in list?
Return type:bool
d6tstack.helpers.file_extensions_contains_csv(ext_list)[source]
d6tstack.helpers.file_extensions_contains_xls(ext_list)[source]
d6tstack.helpers.file_extensions_contains_xlsx(ext_list)[source]
d6tstack.helpers.file_extensions_get(fname_list)[source]

Returns file extensions in list

Parameters:fname_list (list) – file names, eg [‘a.csv’,’b.csv’]
Returns:file extensions for each file name in input list, eg [‘.csv’,’.csv’]
Return type:list
d6tstack.helpers.file_extensions_valid(ext_list)[source]

Checks if file list contains only valid files

Notes

Assumes all file extensions are equal! Only checks first file

Parameters:ext_list (list) – file extensions, eg [‘.csv’,’.csv’]
Returns:first element in list is one of [‘.csv’,’.txt’,’.xls’,’.xlsx’]?
Return type:bool
d6tstack.helpers.list_common(_list, sort=True)[source]
d6tstack.helpers.list_tofront(_list, val)[source]
d6tstack.helpers.list_unique(_list, sort=True)[source]

d6tstack.helpers_ui module

d6tstack.sniffer module

Finds CSV settings and Excel sheets in multiple files. Often needed as input for stacking

class d6tstack.sniffer.CSVSniffer(fname, nlines=10, delims=', ;t|')[source]

Bases: object

Automatically detects settings needed to read csv files. SINGLE file only, for MULTI file use CSVSnifferList

Parameters:
  • fname (string) – file path
  • nlines (int) – number of lines to sample from each file
  • delims (string) – possible delimiters, default “,; |
check_column_length_consistent()[source]
count_skiprows()[source]
get_delim()[source]
has_header()[source]
has_header_inverse()[source]
read_nlines()[source]
scan_delim()[source]
class d6tstack.sniffer.CSVSnifferList(fname_list, nlines=10, delims=', ;t|')[source]

Bases: object

Automatically detects settings needed to read csv files. MULTI file use

Parameters:
  • fname_list (list) – file names, eg [‘a.csv’,’b.csv’]
  • nlines (int) – number of lines to sample from each file
  • delims (string) – possible delimiters, default ‘,; |
count_skiprows()[source]
get_all(fun_name, msg_error)[source]
get_delim()[source]
has_header()[source]
d6tstack.sniffer.csv_count_rows(fname)[source]
d6tstack.sniffer.sniff_settings_csv(fname_list)[source]

d6tstack.sync module

d6tstack.utils module

class d6tstack.utils.PrintLogger[source]

Bases: object

send(data)[source]
send_log(msg, status)[source]
d6tstack.utils.pd_readsql_query_from_sqlengine(uri, sql, schema_name=None, connect_args=None)[source]

Load SQL statement into pandas dataframe using sql_engine.execute making execution faster.

Parameters:
  • uri (str) – postgres psycopg2 sqlalchemy database uri
  • sql (str) – sql query
  • schema_name (str) – name of schema
  • connect_args (dict) – dictionary of connection arguments to pass to sqlalchemy.create_engine
Returns:

pandas dataframe

Return type:

df

d6tstack.utils.pd_readsql_table_from_sqlengine(uri, table_name, schema_name=None, connect_args=None)[source]

Load SQL table into pandas dataframe using sql_engine.execute making execution faster. Convenience function that returns full table.

Parameters:
  • uri (str) – postgres psycopg2 sqlalchemy database uri
  • table_name (str) – table
  • schema_name (str) – name of schema
  • connect_args (dict) – dictionary of connection arguments to pass to sqlalchemy.create_engine
Returns:

pandas dataframe

Return type:

df

d6tstack.utils.pd_to_mssql(df, uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv')[source]

Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.

Parameters:
  • df (dataframe) – pandas dataframe
  • uri (str) – mysql mysqlconnector sqlalchemy database uri
  • table_name (str) – table to store data in
  • schema_name (str) – name of schema in db to write to
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • tmpfile (str) – filename for temporary file to load from
Returns:

True if loader finished

Return type:

bool

d6tstack.utils.pd_to_mysql(df, uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=', ', newline='\n')[source]

Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.

Parameters:
  • df (dataframe) – pandas dataframe
  • uri (str) – mysql mysqlconnector sqlalchemy database uri
  • table_name (str) – table to store data in
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • tmpfile (str) – filename for temporary file to load from
  • sep (str) – separator for temp file, eg ‘,’ or ‘ ‘
Returns:

True if loader finished

Return type:

bool

d6tstack.utils.pd_to_psql(df, uri, table_name, schema_name=None, if_exists='fail', sep=', ')[source]

Load pandas dataframe into a sql table using native postgres COPY FROM.

Parameters:
  • df (dataframe) – pandas dataframe
  • uri (str) – postgres psycopg2 sqlalchemy database uri
  • table_name (str) – table to store data in
  • schema_name (str) – name of schema in db to write to
  • if_exists (str) – {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See pandas.to_sql() for details
  • sep (str) – separator for temp file, eg ‘,’ or ‘ ‘
Returns:

True if loader finished

Return type:

bool

Module contents