A while back, I found myself deep in the trenches of prepping a finance dataset for a performance dashboard. The data, sourced from ExactonLine, was neatly organized by projects, each with its own start and end date. The task seemed simple enough: provide a month-by-month breakdown of each project’s lifecycle, incorporating projections, work in progress, invoices, and costs. No big deal, right?
Naturally, my first instinct was to reach for the ‘trusty’ recursive CTE—after all, it’s the Swiss Army knife of SQL. It worked like a charm at first, slicing through the problem, delivering those monthly breakdowns. But as the dataset grew, so did my headaches. The queries became tangled, performance took a nosedive, and debugging turned into quite a puzzle.
Of course, there’s always more than one way to skin a cat (though no cats were harmed in the making of this dataset), and I soon realized that maybe, just maybe, the recursive CTE wasn’t the knight in shining armour I’d hoped for. So, I began exploring alternative methods—ones that were more efficient, easier to maintain, and didn’t require a detective to figure out. This is the story of how I ditched the recursive approach for something more streamlined, robust, and, frankly, less of a headache.
Why Recursive CTEs Were Used?
Initially, the recursive CTE was employed to generate a continuous list of months for each project, starting from the project start date and continuing until the end date. This helped in tracking invoices, calculating cumulative revenue, and forecasting future WIP (Work in Progress).
Why the Change?
While recursive CTEs are functional, they often come with performance challenges, especially when scaling or handling large datasets. They also add complexity to the query and can be difficult to troubleshoot or optimize.
Problems with Recursive CTEs
- Recursive CTEs, while useful for certain problems, have several drawbacks in this context:
- Recursive CTEs can be inefficient in large-scale databases, especially when you have a large number of rows to traverse.
- They are executed iteratively, meaning each recursion step waits for the previous step to complete, which can slow down query execution.
- In my case, generating months for multiple projects could lead to performance degradation as the dataset grows.
- Complexity in Query Logic:
- Recursive CTEs introduce additional complexity in query logic, making the code harder to maintain or optimize.
- When dealing with window functions like cumulative sums, combining them with recursion could make debugging errors or performance bottlenecks more difficult.
- Limited Scalability:
- Recursive CTEs are often harder to scale with increasing data size. They might work well on smaller datasets, but as more projects and months are added, they become harder to manage and slower to execute.
The New Approach: Using Sequence Generators
The logic now relies on Snowflake’s sequence generator using seq4()
combined with a date-spine to generate the month breakdown.
- Using Sequence Generator:
- What It Does?
Instead of recursing through project months, this method generates a series of numbers (row_number()
usingseq4()
function). Each number represents a month offset from the project start date, making the query non-recursive. - Why It’s Better?
This approach is non-recursive and can process much larger datasets faster. It doesn’t rely on one step completing before another can begin, which improves query execution times.
- What It Does?
- Spine Logic:
- Date-Spine Approach: By joining the sequence generator with the date function, I can systematically generate each month’s date within a project’s start and end date range.
- Simpler Logic: Instead of a recursive traversal, I am working with a flat list of months, making the logic much easier to understand, manage, and optimize.
Comparing the Approaches
Recursive CTE:
- Pros
- Easy to implement for small datasets.
- Works well when the data hierarchy is not too deep.
- Cons
- Slower performance as data grows.
- More complex and harder to debug.
- Can create bottlenecks with window functions and cumulative calculations.
Sequence Generator:
- Pros:
- Faster Execution: Handles larger datasets more efficiently by avoiding recursion.
- Scalability: Better suited for scaling as project data increases.
- Simplicity: Easier to understand and maintain.
- Flexibility: You can manipulate the spine and easily adjust date ranges without altering recursive logic.
- Cons:
- Slightly more initial setup to ensure the sequence generation matches the project duration.
A Closer Look: Practical Example
-- Generate month breakdown using seq4()
month_series as (
select
p.id,
p.project_code,
p.account,
p.account_code,
p.account_name,
p.budget_amount,
p.project_start_date,
p.project_end_date,
p.description,
p.type,
p.type_description,
p.classification_description,
dateadd(month, seq.seq, date_trunc('month', p.project_start_date)) as project_month
from projects p
join (
select
row_number() over (order by seq4()) - 1 as seq
from
(select seq4() as seq from table(generator(rowcount => 10000))) as t
) seq
on seq.seq <= datediff(month, p.project_start_date, coalesce(p.project_end_date, last_day(current_date)))
)
Let’s break it down and check the join:
join (
select
row_number() over (order by seq4()) - 1 as seq
from
(select seq4() as seq from table(generator(rowcount => 10000))) as t
) seq
on seq.seq <= datediff(month, p.project_start_date, coalesce(p.project_end_date, last_day(current_date)))
This generates a sequence of numbers (representing months) using row_number()
, which then gets joined to each project based on the number of months between the project start and end dates.
It replaces the need for recursion by simply creating a numerical series that aligns with the project’s timeline. It’s an easy-to-understand, flat logic that removes the need for complex recursive traversal.
Handling Window Functions and WIP Calculation
One of the key challenges was handling Work in Progress (WIP) calculations and distributing values across time. This requires calculating cumulative revenue, costs, and projecting WIP evenly across remaining months for each project.
Using a Sequence Generator made further calculations more efficient. Window functions now work seamlessly across the sequence of months without needing recursive logic, further boosting performance. The simplified logic allowed easier integration of Work in Progress future calculations, and made the breakdown across months more accurate and easier to maintain.
Conclusion: Moving Towards Simplicity and Efficiency
While recursive CTEs may be extremely useful for specific tasks, leveraging built-in Snowflake features like sequence generators can lead to cleaner, more scalable, and robust solutions.
Happy Days!
Thank you for reading this blog.
Also check out our other blogs page to view more blogs on Power BI, Tableau, Alteryx, and Snowflake here.
Work together with one of our consultants and maximize the effects of your data.
Contact us, and we'll help you right away.