I'am working on a script for migrating data from MongoDB to Clickhouse. Because of the reason that nested structures are'nt implemented good enough in Clickhouse, I iterate over nested structure and bring them to flat representation, where every element of nested structure is a distinct row in Clickhouse database.
What I do is iterate over list of dictionaries and take target values. The structure looks like this:
[ {'Comment': None,'Details': None,'FunnelId': 'MegafonCompany','IsHot': False,'IsReadonly': False,'Name': 'Новый','SetAt': datetime.datetime(2018, 4, 20, 10, 39, 55, 475000),'SetById': 'ekaterina.karpenko','SetByName': 'ЕкатеринаКарпенко','Stage': {'Label': 'Новые','Order': 0,'_id': 'newStage' },'Tags': None,'Type': 'Unknown','Weight': 120,'_id': 'new' }, {'Comment': None,'Details': {'Name': 'взятвработу', '_id': 1 },'FunnelId': 'MegafonCompany','IsHot': False,'IsReadonly': False,'Name': 'Вработе','SetAt': datetime.datetime(2018, 4, 20, 10, 40, 4, 841000),'SetById': 'ekaterina.karpenko','SetByName': 'ЕкатеринаКарпенко','Stage': {'Label': 'Приглашениенаинтервью','Order': 1,'_id': 'recruiterStage' },'Tags': None,'Type': 'InProgress','Weight': 80,'_id': 'phoneInterview' }]
I have a function that does this on dataframe object via data.iterrows() method:
def to_flat(data, coldict, field_last_upd):m_status_history = stc.special_mongo_names['status_history_cand']n_statuse_change = coldict['n_statuse_change']['name']data[n_statuse_change] = n_status_change(dp.force_take_series(data, m_status_history))flat_cols = [ x for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT ]old_cols_names = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_PREPARATION ]t_time = time.time()t_len = 0new_rows = list() for j in range(row[n_statuse_change]): t_new_value_row = np.empty(shape=[0, 0]) for k in range(len(flat_cols)): if flat_cols[k]['colsubtype'] == stc.COLSUBTYPE_FLATPATH: new_value = dp.under_value_line( row, path_for_status(j, row[n_statuse_change]-1, flat_cols[k]['path']) ) # Дополнительнообрабатываемдату if flat_cols[k]['name'] == coldict['status_set_at']['name']: new_value = dp.iso_date_to_datetime(new_value) if flat_cols[k]['name'] == coldict['status_set_at_mil']['name']: new_value = dp.iso_date_to_miliseconds(new_value) if flat_cols[k]['name'] == coldict['status_stage_order']['name']: try: new_value = int(new_value) except: new_value = new_value else: if flat_cols[k]['name'] == coldict['status_index']['name']: new_value = j t_new_value_row = np.append(t_new_value_row, dp.some_to_null(new_value)) new_rows.append(np.append(row[old_cols_names].values, t_new_value_row))pdb.set_trace()res = pd.DataFrame(new_rows, columns = [ x['name'] for x in coldict.values() if x['coltype'] == stc.COLTYPE_FLAT or x['coltype'] == stc.COLTYPE_PREPARATION])return res
It takes values from list of dicts, prepare them to correspond Clickhouse's requirements using numpy arrays and then appends them all together to get new dataframe with targeted values and its columnnames.
I've noticed that if nested structure is big enough, it begins to work much slower. I've found an article where different methods of iteration in Python are compared. article
It is claimed that it's much faster to iterate over .apply() method and even faster using vectorization. But the samples given are pretty trivial and rely on using the same function on all of the values. Is it possible to iterate over pandas object in faster manner, while using variety of functions on different types of data?