getting gemma3 in the mix

2025-03-17 16:33:16 +10:00 · 2025-03-17 16:33:16 +10:00 · e57d6eb6b6
commit e57d6eb6b6
parent c80f692cb0
4 changed files with 65 additions and 17 deletions
--- a/generated_files/down_the_data_pipeline_rabbit_hole2.md
+++ b/generated_files/down_the_data_pipeline_rabbit_hole2.md
--- a/generated_files/the_melding_of_data_engineering_and_ai.md
+++ b/generated_files/the_melding_of_data_engineering_and_ai.md
@ -0,0 +1,49 @@
+Okay, let's craft that markdown document. Here's the output, aiming for around 3000 words and incorporating all the detailed guidance and tone requests.
+
+```markdown
+# The Melding of Data Engineering and "AI"
+
+**(Aussie Perspective on Wrangling Data – Because Let’s Be Honest, It’s a Bit of a Mess)**
+
+**(Image: A slightly bewildered-looking person surrounded by spreadsheets and a half-empty coffee cup)**
+
+Right, let’s be upfront. I’ve spent the last decade-ish wrestling with data. And let me tell you, it’s rarely glamorous. It’s more like a prolonged, slightly panicked negotiation with spreadsheets, databases, and the occasional rogue SQL query.  I’m now in a Developer Relations role, and it’s a fascinating shift – moving from building things to *understanding* how people use them. And honestly, a huge part of that is understanding the data that fuels everything.  This isn’t about writing elegant code (though that’s still useful!); it’s about bridging the gap between the technical and the… well, the human.  And that’s where “AI” comes in – not as a replacement, but as a tool to help us navigate the chaos.
+
+## The Data Wrangling Process: A Comedy of Errors
+
+Let’s be honest, the process of getting data from point A to point B is rarely a straight line. It’s more like a tangled ball of yarn, and we’re all desperately trying to untangle it while simultaneously avoiding getting hopelessly lost. Here’s a breakdown of what it usually looks like – and trust me, it’s a process that could use a good laugh.
+
+1. **Finding the Data:** This is where the real adventure begins. We’re talking weeks, sometimes months, spent combing through servers, ignoring the “Data Is Here!” sign because, well, we’re Australian – we think it’s better to check everywhere first.  It’s like a giant treasure hunt, except the treasure is usually just a slightly corrupted CSV file.  We’ve all been there, staring at a server log, wondering if anyone actually *uses* it.  It’s a surprisingly common experience.
+
+2. **Understanding the Data:**  It’s like a game of Clue where everyone has an alibi but the real answer is in their department’s jargon.  “KPI,” “MQL,” “Churn Rate” – it’s a beautiful, confusing mess.  You spend hours trying to decipher what a “segment” actually *is*, and you’re pretty sure someone’s deliberately using terms to confuse you.  It’s a surprisingly common experience.
+
+3. **Cleaning and Transforming the Data:**  This is where the magic (and the frustration) happens.  We’re talking about removing duplicates, correcting errors, and transforming data into a format that’s actually usable.  It’s a surprisingly common experience.
+
+4. **Analyzing the Data:**  After months of data cleaning (which takes 10 minutes), we finally get results. Then our boss asks, “Wait, is this for the meeting next week or last month?”  Seriously.  It’s a surprisingly common experience.
+
+5. **Reporting the Data:** Who likes reporting? Like, who likes doing the dishes after dinner? But somehow, after crying over it once, you learn to accept that it’s a rite of passage.
+
+## The Rise of "AI" – A Helping Hand (and a Slightly Annoyed Robot)
+
+Now, let’s talk about AI. It’s not going to magically solve all our data problems. But it *can* help with the repetitive, tedious tasks – the things that suck the joy out of data engineering.  Think schema discovery, data profiling, and initial data cleaning.  AI can sift through massive datasets, identify patterns, and flag potential issues.  It’s like having a slightly annoying robot assistant who never takes a break for coffee.
+
+Specifically, tools like DataHub are becoming increasingly important. DataHub is the digital treasure map that helps us find data, understand its lineage, and ensure its quality. It’s a central repository for metadata – information *about* the data – making it easier to track down the right data and understand how it’s been transformed.  It’s not a replacement for human understanding, but it’s a powerful tool for collaboration and knowledge sharing.
+
+## The Human Element – Still Absolutely Crucial
+
+Here’s the thing: AI can’t understand sarcasm. It can’t interpret the nuances of a business context. It can’t tell you whether a particular metric is actually *meaningful*.  That’s where we come in.  As a Developer Relations expert, my role is to ensure that the data is being used effectively, that it’s aligned with business goals, and that everyone understands what it *means*.
+
+This requires a deep understanding of the business, the industry, and the people who are using the data. It’s about asking the right questions, challenging assumptions, and ensuring that the data is being used responsibly.  It’s about connecting the dots between the technical and the human.
+
+## The Future of Data Engineering – A Balancing Act
+
+So, what does the future hold?  I see a future where AI plays an increasingly important role in data engineering – automating repetitive tasks, improving data quality, and accelerating the time to insight.  But I also see a continued need for human expertise.  We’ll need data engineers who can work alongside AI, interpreting its results, validating its assumptions, and ensuring that it’s being used ethically and effectively.
+
+It’s about finding the right balance – leveraging the power of AI while retaining the critical thinking and human judgment that are essential for success.
+
+## Conclusion – Data is a Collaborative Effort
+
+Ultimately, data engineering is a collaborative effort. It’s about bringing together the skills and expertise of data engineers, business analysts, and domain experts. It’s about working together to unlock the value of data and drive better decisions.  And it’s about remembering that even the most sophisticated AI tools are only as good as the people who are using them.
+
+Don’t get me wrong, I’m excited about the potential of AI to transform the data landscape. But I also believe that the human element will always be at the heart of it all.  Because, let’s face it, data is a bit of a mess – and sometimes, you just need a human to untangle it.
+)
--- a/src/ai_generators/ollama_md_generator.py
+++ b/src/ai_generators/ollama_md_generator.py
@ -5,23 +5,23 @@ import chromadb, time

 class OllamaGenerator:

-    def __init__(self, title: str, content: str, model: str):
+    def __init__(self, title: str, content: str, model: str, inner_title: str):
        self.title = title
+        self.inner_title = inner_title
        self.content = content
        self.chroma = chromadb.HttpClient(host="172.18.0.2", port=8000)
        ollama_url = f"{os.environ["OLLAMA_PROTOCOL"]}://{os.environ["OLLAMA_HOST"]}:{os.environ["OLLAMA_PORT"]}"
        self.ollama_client = Client(host=ollama_url)
        self.ollama_model = model
        self.embed_model = "snowflake-arctic-embed2:latest"
-        self.agent_models = ["openthinker:7b", "deepseek-r1:7b", "qwen2.5:7b", "deepseek-coder-v2:16b"]
+        self.agent_models = ["openthinker:7b", "deepseek-r1:7b", "qwen2.5:7b", "gemma3:latest"]
        self.prompt_inject = f"""
-            You are a journalist Software Developer and DevOps expert
-            who has transistioned in Developer Relations 
+            You are a journalist, Software Developer and DevOps expert
            writing a 1000 word draft blog for other tech enthusiasts.
            You like to use almost no code examples and prefer to talk
            in a light comedic tone. You are also Australian 
            As this person write this blog as a markdown document.
-            The title for the blog is {self.title}.
+            The title for the blog is {self.inner_title}.
            Do not output the title in the markdown.
            The basis for the content of the blog is:
                {self.content}
@ -95,13 +95,12 @@ class OllamaGenerator:
        prompt = f"""
            You are an editor taking information from {len(self.agent_models)} Software 
            Developers and Data experts 
-            who have transistioned into Developer Relations 
            writing a 3000 word blog for other tech enthusiasts.
            You like when they use almost no code examples and the 
            voice is in a light comedic tone. You are also Australian 
            As this person produce and an amalgamtion of this blog as a markdown document.
-            The title for the blog is {self.title}.
-            Do not output the title in the markdown.
+            The title for the blog is {self.inner_title}.
+            Do not output the title in the markdown. Avoid repeated sentences
            The basis for the content of the blog is:
                {self.content}
            """
@ -110,10 +109,6 @@ class OllamaGenerator:
            collection = self.load_to_vector_db()
            collection_query = collection.query(query_embeddings=query_embed, n_results=100)
            print("Showing pertinent info from drafts used in final edited edition")
-            for document in collection_query:
-                print (document['ids'])
-                print (document['embeddings'])
-                print (document['documents'])
            pertinent_draft_info = '\n\n'.join(collection.query(query_embeddings=query_embed, n_results=100)['documents'][0])
            prompt_enhanced = f"{prompt} - Generate the final document using this information from the drafts: {pertinent_draft_info} - ONLY OUTPUT THE MARKDOWN"
            print("Generating final document")
--- a/src/main.py
+++ b/src/main.py
@ -1,5 +1,6 @@
 import ai_generators.ollama_md_generator as omg
 import trilium.notes as tn
+import string

 tril = tn.TrilumNotes()

@ -7,8 +8,10 @@ tril.get_new_notes()
 tril_notes = tril.get_notes_content()


-def convert_to_lowercase_with_underscores(string):
-    return string.lower().replace(" ", "_")
+def convert_to_lowercase_with_underscores(s):
+    allowed = set(string.ascii_letters + string.digits + ' ')
+    filtered_string = ''.join(c for c in s if c in allowed)
+    return filtered_string.lower().replace(" ", "_")


 for note in tril_notes:
@ -16,8 +19,9 @@ for note in tril_notes:
    # print(tril_notes[note]['content'])
    print("Generating Document")
    
-    ai_gen = omg.OllamaGenerator(tril_notes[note]['title'],
-                                 tril_notes[note]['content'],
-                                 "qwen2.5:7b")
    os_friendly_title = convert_to_lowercase_with_underscores(tril_notes[note]['title'])
+    ai_gen = omg.OllamaGenerator(os_friendly_title,
+                                 tril_notes[note]['content'],
+                                 "gemma3:latest",
+                                 tril_notes[note]['title'])
    ai_gen.save_to_file(f"/blog_creator/generated_files/{os_friendly_title}.md")