
PROJECT
Cyber Tables Python Module
Overview
When 2025 started, I decided to set some structured goals for myself and among them was to conduct a number of data analysis for the year. To spice things up, I also set the goal of conducting a data analysis in Python without using Pandas. To accomplish this, I realised that I could use the opportunity to create my own pandas – my own data analysis module. Make my own is exactly what I did!
In this project, I’ll take you through my new data analysis module in Python, how it works, the syntax, and where you can download it yourself!
Development
2025-06-21 – First build released and this website page began to be written. Uploaded to GitHub.
Next plans:
- Add easier ways to lock and unlock a column data type
- Replace code with functions written after the fact during development
- Add in type annotation
- Add more column calculations & aggregations
- Find and iron out bugs with current version
- Write GitHub documentation
Location on GitHub
My scripts are on my public script GitHub:
To use this, simply download it and place it in your site-packages folder and import it as normal. In the future, I may try and get this legitimately installable via pip. The site packages folder can be found by importing a normal module and printing the __file__ property.
Reading in a CSV and saving back to one
I’ve built this to be as compact as it can be. To open a csv, you can simply use open_csv(location) and cyber_tables will take care of the rest, including auto-detecting data types and replacing empty entries with “NULL”, and cleaning strings.
Structure
A table is made up of three classes:
CyberTable – The main object that contains all of the information, such as column indexes and objects, and row indexes and objects
Column – An object for a single column that tracks the index, name and data type of the column
Row – An object for each row, containing an index and a list of items
There is a fourth class: CyberTableGroup.
This class is an object with columns like a normal table, but contains within it a list of table objects as well as specific functions that allow you to perform tasks on all tables within the group. You don’t need to actually implement this – the cyber table functions should use this in the background, but it may provide some use if used manually.
There are 7 data types:
- string
- int
- decimal
- bool
- date
- datetime
- NULL
Table information & returning column data
Here are some functions that give you an overview of the data in the table and the structure of the object.
The following four functions will print the table rows from the top, bottom, a random selection, or the entire table.
Print structure will give you an overview of the table object, columns, and the rows.
Print columns will simply print the column indexes, names and data types.
You can also return all data in a column as a list. The default argument will be for the column index, but I’ve built it to accept either the index or the column name. If both are given (why would you do that?) it will prioritise the index.
Returning sub tables
Sub tables are simply normal cyber tables that have been reduced by column or row count.
You can return a new cyber table object with only the specified columns by index or name. All inputs must be in a list.
You can also return a new cyber table where all filter conditions are found to be true. In the example below, it will return all rows where column index 0 = True, column index 1 = 0, and column index 2 = “Dragon”. All inputs must be in a list.
You can make a duplicate of your current cyber table with return_copy().
Column-specific code
In cyber_tables, all columns are a Column object. There are some functions that you can use to alter, add and remove columns.
Every column has an index, which depending on the function, may be reset, so if you’re doing any adding or removing of columns, make sure to print the columns first if you’re using the index and not the name.
You can remove a column by using the .remove_column() function. It will automatically reset the other column indexes.
There are two ways to add a column with data. You can insert a new column by giving it a name, which will add a new NULL column and add “NULL” to that position in all rows, and then add the data with the update_data_in_column() function. Alternatively, if your data is in a list and ready to go, you can use the insert_column_with_data() function. The auto_analyse argument can be set to False if you don’t want cyber_tables to automatically assign a data type.
Analysing the columns will reset the indexes and analyse all non-null data to figure out the most appropriate data type (unless you’ve locked the data type, or set auto_analyse to False when creating it).
You can change the column data type by entering the desired type as a string using .change_column_data_type(), or lock a column if you use the lock or unlock_column_data_type() functions on the column in the table.
You can update a column name either by inputting the index or the old column name. You can also reset the indexes manually with the reset_column_indexes() function.
Lastly for this section, I’ve added in some quality of life features for strings and date times. ISO 8601 datetimes in the format YYYY-MM-DDTHH:MM:SSZ, for example: 2025-06-22T16:11:12Z can be automatically converted to a datetime by removing the Z and replacing the T with a space and parsing it as a datetime.
Strings can be cleaned which will trim all white space, with capital_first_letter = True allowing you to capitalise the first letter.
You can also set the column string case with options:
- “lower”
- “upper”
- “title”
- “lower_snake”
- “upper_snake”
- “title_snake”
Lower and upper will set the string case to lower case and upper case respectively. Title case will set the first character to being upper case following a space, with all other letters being lower case. The snake variants will do the same as their non-snake counterparts, but will replace all spaces with an underscore.
Row-specific code
In cyber_tables, rows are also objects of the Row() type which contain a record of the row index, and the list of items in the row.
You can reset the row indexes with the reset_row_indexes() function. I’m not sure you’d ever have to do this, as when you remove rows, cyber_tables should reset the indexes, anyway.
You can remove rows based on the value in a specific column by using the .remove_rows_by_column_value() method.
You can also remove a row by specifying the row index. I don’t think there is much of a need for this, but I included it regardless.
You can return the row items based on the index using the .return_row_items_by_index() function, and return all rows as a list of lists using the .return_rows_as_lists() function.
You can return all distinct values in a column using the .return_distinct_column_values() and giving it either an index or a column name.
You can remove duplicate rows by using the .remove_duplicate_rows() function, or remove duplicates based on a list of columns, by either providing a list of indexes or a list of column names. In both cases, the functions will return the number of removals, and the indexes of those removed rows.
You can also use the .return_table_by_nulls_in_column() to return a new cyber_table where the column index or name in the input only contains “NULL”.
Lastly for this section, you can sort the rows ascending or descending based on a column index or name.
Table Calculations
Table calculations are functions that return (usually) a single value based on the calculation.
You can use .return_sum() inputting the index or column name to get a sum of all items in that column. Sum will only work on int or decimal data.
return_min_value(), return_max_value() and return_range() will return the smallest, largest, or the difference between largest and smallest value in the column respectively. The neat thing about these functions, is that I designed them to work with date, datetime, bool and string columns. String values will be represented by the numerical length of the string, bools by whether they’re True or False, and datetimes will be returned as either the value in the case of min or max, or a timedelta in the case of a range.
Mode, median and mean functions will behave very similarly to min, max and range in that they work with almost all data types. Strings will again be represented by the numerical length of each string.
Nulls and non-nulls will return the number of “NULL” or non-“NULL” values in a column.
Variance and standard_deviation functions will return those calculations from int and decimal columns.
You can also return a count of how many True or False entries there are in a column. This will only work for a bool data type column.
Calculation columns
I’ve built in a way that you can add new columns and populate the rows under that column with special calculations. They all use the same function as shown below. Available options:
- rank – order the table by a column and give rows a rank largest to smallest
- row_number – simply labels all of the rows top to bottom with a row number
- ntile – allows you to order a table by a column and split the data into however many buckets you want to group the data by
- individual_variance – adds data showing the variance of each individual row by the reference column
- individual_std – adds data showing the standard deviation of each individual row by the reference column
- + days – add x number of days onto date and datetime a column
- – days – subtract x numer of days from a date or datetime column
- above_threshold_percentage – adds True or False if the reference column exceeds the percentage threshold
- below_threshold_percentage – adds True or False if the reference column does not exceed the percentag threshold
Rank works on almost all data types. It uses Python’s sorted() function on the list of items in the column so while it will technically work on bool data, it probably won’t serve a purpose. Row number takes the table as it currently is.
Ntile calculations will order the table by the reference column and split the rows into as equal buckets as possible. calculation_value=3 will create three buckets. The first two buckets should be of equal size, with the last bucket being slightly larger to account for any remainders.
Individual variance and standard_deviation calculations will add a new value in the new column for each row’s variance or standard_deviation by the reference column.
+ days and – days will add or subtract that many days from any date or datetime column.
Above and below threshold percentage will add True or False for each row where the column index is above or below that percentage in relation to the rest of the table. A value of 50 = 50%.
Data Aggregation
Aggregation can be done either on a normal table, or done within a group class, but the outcome is the same. You can mentally visualise this as condensing multiple rows down into one, to calculate something based on a reference column. Each aggregation becomes a new column. Aggregation options:
- sum
- mean
- median
- mode
- min
- max
- range
- nulls
- non_nulls
- row_counts
- standard_deviation
- variance
- true_percentage
- false_percentage
Aggregation statements are slightly more complex than the others, but what they all effectively do is:
- Create a table for all distinct values in the reference column(s)
- Create a CyberTableGroups object and add all tables to it
- Perform all calculations to all calculation columns in all tables
- Condense all tables to a single line with only the distinct reference columns and the new calculation columns
- Merge all single-line tables back into a CyberTable and return that at the end
When performing an aggregation on a CyberTable, you will receive a new table back with only the reference columns and the calculations applied to the calculation columns.
Min, max and range work as they do in the calculation column statements earlier in this page.
Mean, median and mode also work in the same way, allowing you to for example, perform a median calculation on a string column. Calculations will always ignore NULL values.
You can also simply sum a column.
Nulls, non-nulls and row counts work as they do earlier. The benefit of row_counts in this context is that you can see how many rows were in the group regardless of whether it contained any NULL values.
Variance and standard deviation in this case, unlike earlier, will calculate it for all values as opposed to each individual row.
True_percentage and false_percentage will return the percentage of the column in a bool calculation column that contained True or False values respectively, ignoring any NULL values.
CyberTableGroups
CyberTableGroups is an object that the normal .aggregate() function within a CyberTable object will utilise, but you can also access and use this class manually should you want to.
To return a CyberTableGroups object, simply use the .return_groups() function on a table and specify the indexes of the columns to group the table by.
To merge a group back into a table, use the .merge_into_cyber_table() function on a group object.
You can iterate through all tables in a group and print the top, bottom, or a random selection of rows using the group functions .top(), .bottom() and .random_selection() just like a table can.
You can add calculations to the tables within a group all at once using the .add_batch_row_calculations() function and format it identically to the normal calculation column code of a CyberTable.
The group object also has a .aggregate() function that works exactly as it would do in a CyberTable and returns a CyberTable with the calculated date in it.
Summary
In summary, this was one hell of an enjoyable and worth-while project. I genuinely want to both build this into the future, and use it on my own data analysis. Having total freedom to implement and change what I like is a huge opportunity to make something special. I’ll continue to develop this as time goes on and make it even better!