d6tstack package¶
Submodules¶
d6tstack.combine_csv module¶
-
class
d6tstack.combine_csv.
CombinerCSV
(fname_list, sep=', ', nrows_preview=3, chunksize=1000000.0, read_csv_params=None, columns_select=None, columns_select_common=False, columns_rename=None, add_filename=True, apply_after_read=None, log=True, logger=None)[source]¶ Bases:
object
Core combiner class. Sniffs columns, generates preview, combines aka stacks to various output formats.
Parameters: - fname_list (list) – file names, eg [‘a.csv’,’b.csv’]
- sep (string) – CSV delimiter, see pandas.read_csv()
- has_header (boolean) – data has header row
- nrows_preview (int) – number of rows in preview
- chunksize (int) – number of rows to read into memory while processing, see pandas.read_csv()
- read_csv_params (dict) – additional parameters to pass to pandas.read_csv()
- columns_select (list) – list of column names to keep
- columns_select_common (bool) – keep only common columns. Use this instead of columns_select
- columns_rename (dict) – dict of columns to rename `{‘name_old’:’name_new’}
- add_filename (bool) – add filename column to output data frame. If False, will not add column.
- apply_after_read (function) – function to apply after reading each file. needs to return a dataframe
- log (bool) – send logs to logger
- logger (object) – logger object with send_log()
-
columns
(*args, **kwargs)¶
-
columns_common
(*args, **kwargs)¶
-
columns_unique
(*args, **kwargs)¶
-
combine_preview
(*args, **kwargs)¶
-
get_sniff_results
(*args, **kwargs)¶
-
head
(*args, **kwargs)¶
-
is_all_equal
(*args, **kwargs)¶
-
is_column_present
(*args, **kwargs)¶
-
is_column_present_common
(*args, **kwargs)¶
-
is_column_present_unique
(*args, **kwargs)¶
-
preview_rename
(*args, **kwargs)¶
-
preview_select
(*args, **kwargs)¶
-
sniff_columns
(*args, **kwargs)¶
-
to_csv_align
(*args, **kwargs)¶
-
to_csv_combine
(*args, **kwargs)¶
-
to_csv_head
(*args, **kwargs)¶
-
to_mssql_combine
(*args, **kwargs)¶
-
to_mysql_combine
(*args, **kwargs)¶
-
to_pandas
(*args, **kwargs)¶
-
to_parquet_align
(*args, **kwargs)¶
-
to_parquet_combine
(*args, **kwargs)¶
-
to_psql_combine
(*args, **kwargs)¶
-
to_sql_combine
(*args, **kwargs)¶
d6tstack.convert_xls module¶
-
class
d6tstack.convert_xls.
XLSSniffer
(fname_list, logger=None)[source]¶ Bases:
object
Extracts available sheets from MULTIPLE Excel files and runs diagnostics
Parameters: - fname_list (list) – file paths, eg [‘dir/a.csv’,’dir/b.csv’]
- logger (object) – logger object with send_log(), optional
-
all_contain_sheetname
(*args, **kwargs)¶
-
all_have_idx
(*args, **kwargs)¶
-
all_same_count
(*args, **kwargs)¶
-
all_same_names
(*args, **kwargs)¶
-
sniff
(*args, **kwargs)¶
-
class
d6tstack.convert_xls.
XLStoBase
(if_exists='skip', output_dir=None, logger=None)[source]¶ Bases:
object
-
convert_single
(*args, **kwargs)¶
-
-
class
d6tstack.convert_xls.
XLStoCSVMultiFile
(fname_list, cfg_xls_sheets_sel_mode='idx_global', cfg_xls_sheets_sel=0, output_dir=None, if_exists='skip', logger=None)[source]¶ Bases:
d6tstack.convert_xls.XLStoBase
Converts xls|xlsx files to csv files. Selects a SINGLE SHEET from each file. To extract MULTIPLE SHEETS from a file use XLStoCSVMultiSheet
Parameters: - fname_list (list) – file paths, eg [‘dir/a.csv’,’dir/b.csv’]
- cfg_xls_sheets_sel_mode (string) –
mode to select tabs
name
: select by name, provide name for each file, can customize by filename_global
: select by name, one name for all filesidx
: select by index, provide index for each file, can customize by fileidx_global
: select by index, one index for all files
- cfg_xls_sheets_sel (dict) – values to select tabs {‘filename’:’value’}
- output_dir (str) – If present, file is saved in given directory, optional
- if_exists (str) – Possible values: skip and replace, default: skip, optional
- logger (object) – logger object with send_log(‘msg’,’status’), optional
-
convert_all
(*args, **kwargs)¶
-
set_files
(*args, **kwargs)¶
-
set_select_mode
(*args, **kwargs)¶
-
class
d6tstack.convert_xls.
XLStoCSVMultiSheet
(fname, sheet_names=None, output_dir=None, if_exists='skip', logger=None)[source]¶ Bases:
d6tstack.convert_xls.XLStoBase
Converts ALL SHEETS from a SINGLE xls|xlsx files to separate csv files
Parameters: - fname (string) – file path
- sheet_names (list) – list of int or str. If not given, will convert all sheets in the file
- output_dir (str) – If present, file is saved in given directory, optional
- if_exists (str) – Possible values: skip and replace, default: skip, optional
- logger (object) – logger object with send_log(‘msg’,’status’), optional
-
convert_all
(*args, **kwargs)¶
-
convert_single
(*args, **kwargs)¶
-
d6tstack.convert_xls.
read_excel_advanced
(fname, remove_blank_cols=True, remove_blank_rows=True, collapse_header=True, header_xls_range=None, header_xls_start=None, header_xls_end=None, is_preview=False, nrows_preview=3, **kwds)[source]¶ Read Excel files to pandas dataframe with advanced options like set header ranges and remove blank columns and rows
Parameters: - fname (str) – Excel file path
- remove_blank_cols (bool) – remove blank columns
- remove_blank_rows (bool) – remove blank rows
- collapse_header (bool) – to convert multiline header to a single line string
- header_xls_range (string) – range of headers in excel, eg: A4:B16
- header_xls_start (string) – Starting cell of excel for header range, eg: A4
- header_xls_end (string) – End cell of excel for header range, eg: B16
- is_preview (bool) – Read only first nrows_preview lines
- nrows_preview (integer) – Initial number of rows to be used for preview columns (default: 3)
- kwds (mixed) – parameters for pandas.read_excel() to pass through
Returns: pandas dataframe
Return type: df (dataframe)
Note
You can pass in any pandas.read_excel() parameters in particular sheet_name
d6tstack.helpers module¶
Module with several helper functions
-
d6tstack.helpers.
columns_all_equal
(col_list)[source]¶ Checks that all lists in col_list are equal.
Parameters: col_list (list) – columns, eg [[‘a’,’b’],[‘a’,’b’,’c’]] Returns: all lists in list are equal? Return type: bool
-
d6tstack.helpers.
file_extensions_all_equal
(ext_list)[source]¶ Checks that all file extensions are equal.
Parameters: ext_list (list) – file extensions, eg [‘.csv’,’.csv’] Returns: all extensions are equal to first extension in list? Return type: bool
-
d6tstack.helpers.
file_extensions_get
(fname_list)[source]¶ Returns file extensions in list
Parameters: fname_list (list) – file names, eg [‘a.csv’,’b.csv’] Returns: file extensions for each file name in input list, eg [‘.csv’,’.csv’] Return type: list
-
d6tstack.helpers.
file_extensions_valid
(ext_list)[source]¶ Checks if file list contains only valid files
Notes
Assumes all file extensions are equal! Only checks first file
Parameters: ext_list (list) – file extensions, eg [‘.csv’,’.csv’] Returns: first element in list is one of [‘.csv’,’.txt’,’.xls’,’.xlsx’]? Return type: bool
d6tstack.helpers_ui module¶
d6tstack.sniffer module¶
Finds CSV settings and Excel sheets in multiple files. Often needed as input for stacking
-
class
d6tstack.sniffer.
CSVSniffer
(fname, nlines=10, delims=', ;t|')[source]¶ Bases:
object
Automatically detects settings needed to read csv files. SINGLE file only, for MULTI file use CSVSnifferList
Parameters: - fname (string) – file path
- nlines (int) – number of lines to sample from each file
- delims (string) – possible delimiters, default “,; |”
-
check_column_length_consistent
(*args, **kwargs)¶
-
count_skiprows
(*args, **kwargs)¶
-
get_delim
(*args, **kwargs)¶
-
has_header
(*args, **kwargs)¶
-
has_header_inverse
(*args, **kwargs)¶
-
read_nlines
(*args, **kwargs)¶
-
scan_delim
(*args, **kwargs)¶
-
class
d6tstack.sniffer.
CSVSnifferList
(fname_list, nlines=10, delims=', ;t|')[source]¶ Bases:
object
Automatically detects settings needed to read csv files. MULTI file use
Parameters: - fname_list (list) – file names, eg [‘a.csv’,’b.csv’]
- nlines (int) – number of lines to sample from each file
- delims (string) – possible delimiters, default ‘,; |’
-
count_skiprows
(*args, **kwargs)¶
-
get_all
(*args, **kwargs)¶
-
get_delim
(*args, **kwargs)¶
-
has_header
(*args, **kwargs)¶