In this work, we explore the enhancement of polar codes for channels with memory, focusing on achieving low decoding complexity and optimizing input distributions for maximum transmission rates. Polar codes are known for their efficient decoding, exhibiting a complexity of \( O(N\log N) \) in memoryless channels, and complexity of \( O(|\cS|^3N\log N) \) in \glspl{fsc}, where \( |\cS| \) is the state space size. A notable recent advancement is the integration of \glspl{nn} to create an \gls{npd}, which is adept at learning from data without the knowledge of the channel model, effectively bypassing the cubic complexity growth associated with the channel state size. In this paper, we propose a framework to optimize the input distribution for polar codes, aiming to maximize the mutual information of effective bit channels. This framework has been tested on both memoryless and \glspl{fsc}, including the \gls{awgn} channel and the Ising channel, yielding promising results. The key contribution of this paper is the demonstration of the feasibility of simultaneously selecting an optimal input distribution and creating a practical decoder for various channel types, even in the absence of a channel model. This approach paves the way for new advancements in data-driven communication theory, especially for channels with memory.